[2025-11-13 08:04:09,158][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:09,963][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:04:09,970][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2025-11-13 08:04:11,029][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 08:06:20,547][__main__][INFO] - Starting iteration 0. [2025-11-13 08:06:20,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:20,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:25,146][__main__][INFO] - Number of regex retries in iteration 0: 0 [2025-11-13 08:06:25,147][__main__][INFO] - agents played in iteration 0 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:06:25,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-13 08:06:25,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:25,685][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:26,337][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:26,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:27,560][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:27,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:28,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:28,528][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:28,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:29,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:29,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:29,822][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:30,467][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:30,790][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:31,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:31,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:31,759][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:32,083][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:32,729][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:33,052][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:33,376][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:33,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:34,020][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:34,342][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:34,664][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:34,988][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:35,634][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:36,278][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:36,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:37,725][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 4.58%, Current % of VRAM taken: 42.03%, Block Peak % of device VRAM: 25.21%, ΔTime: 00:00:11 [2025-11-13 08:06:38,524][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:38,526][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:38,528][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:39,742][__main__][INFO] - Iteration 1 took 19s (23.94% Gen, 69.72% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 56m 44s. Estimated total time: 15h 59m 34s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 59s, 500 more iterations: 2h 39m 55s. [2025-11-13 08:06:39,744][__main__][INFO] - Starting iteration 1. [2025-11-13 08:06:39,747][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:39,748][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:43,405][__main__][INFO] - Number of regex retries in iteration 1: 0 [2025-11-13 08:06:43,406][__main__][INFO] - agents played in iteration 1 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:06:43,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:43,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:43,984][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:06:44,958][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:06:45,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:06:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:06:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:06:46,261][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:06:46,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:06:46,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:06:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:06:47,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:06:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:06:48,205][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:06:48,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:06:48,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:06:49,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:06:49,498][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:06:49,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:06:50,148][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:06:50,471][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:06:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:06:51,121][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:06:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:06:51,775][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:06:52,103][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:06:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:06:52,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:06:53,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:06:53,407][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:06:53,733][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:06:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:06:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:06:54,707][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:06:55,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:06:55,713][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:06:56,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:56,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:56,428][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:57,430][__main__][INFO] - Iteration 2 took 17s (20.68% Gen, 73.64% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 41m 3s. Estimated total time: 14h 44m 11s. Time estimates for 10 more iterations: 2m 56s, 100 more iterations: 29m 28s, 500 more iterations: 2h 27m 21s. [2025-11-13 08:06:57,432][__main__][INFO] - Starting iteration 2. [2025-11-13 08:06:57,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:06:57,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:01,163][__main__][INFO] - Number of regex retries in iteration 2: 0 [2025-11-13 08:07:01,164][__main__][INFO] - agents played in iteration 2 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:07:01,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:01,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:01,701][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:03,996][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:04,321][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:04,648][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:04,974][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:05,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:05,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:06,276][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:07,899][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:08,224][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:08,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:09,525][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:09,851][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:10,176][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:11,807][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:12,455][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:12,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:13,445][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:14,181][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:14,183][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:14,184][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:15,177][__main__][INFO] - Iteration 3 took 17s (21.01% Gen, 73.39% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 43m 40s. Estimated total time: 14h 47m 6s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 34s, 500 more iterations: 2h 27m 51s. [2025-11-13 08:07:15,179][__main__][INFO] - Starting iteration 3. [2025-11-13 08:07:15,182][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:15,183][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:18,882][__main__][INFO] - Number of regex retries in iteration 3: 0 [2025-11-13 08:07:18,883][__main__][INFO] - agents played in iteration 3 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:07:19,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:19,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:19,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:20,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:20,741][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:21,069][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:22,061][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:22,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:23,366][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:23,692][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:24,341][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:24,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:25,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:25,331][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:25,663][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:26,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:26,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:26,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:27,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:27,622][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:27,947][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:28,278][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:28,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:28,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:29,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:29,593][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:29,920][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:30,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:31,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:32,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:32,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:32,049][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:33,072][__main__][INFO] - Iteration 4 took 17s (20.68% Gen, 73.59% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 50m 47s. Estimated total time: 14h 54m 31s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 49s, 500 more iterations: 2h 29m 5s. [2025-11-13 08:07:33,074][__main__][INFO] - Starting iteration 4. [2025-11-13 08:07:33,077][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:33,077][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:36,712][__main__][INFO] - Number of regex retries in iteration 4: 0 [2025-11-13 08:07:36,712][__main__][INFO] - agents played in iteration 4 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:07:37,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:37,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:37,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:37,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:38,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:38,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:38,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:39,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:39,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:40,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:41,514][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:42,168][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:07:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:07:42,829][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:07:43,154][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:07:43,479][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:07:43,806][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:07:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:07:44,456][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:07:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:07:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:07:45,439][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:07:45,763][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:07:46,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:07:46,416][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:07:46,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:07:47,067][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:07:47,390][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:07:47,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:07:48,039][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:07:48,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:07:49,052][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:07:49,795][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:49,797][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:49,798][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:50,798][__main__][INFO] - Iteration 5 took 17s (20.51% Gen, 73.84% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 42m 3s. Estimated total time: 14h 46m 5s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 32s, 500 more iterations: 2h 27m 40s. [2025-11-13 08:07:50,800][__main__][INFO] - Starting iteration 5. [2025-11-13 08:07:50,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:07:50,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:54,444][__main__][INFO] - Number of regex retries in iteration 5: 0 [2025-11-13 08:07:54,445][__main__][INFO] - agents played in iteration 5 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:07:54,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:54,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:54,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:55,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:07:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:07:56,308][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:07:56,635][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:07:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:07:57,286][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:07:57,609][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:07:57,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:07:58,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:07:58,585][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:07:58,910][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:07:59,234][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:07:59,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:07:59,886][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:00,858][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:01,181][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:01,506][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:01,830][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:02,154][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:02,477][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:02,801][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:03,126][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:03,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:03,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:04,100][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:04,425][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:04,749][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:05,399][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:05,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:06,049][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:06,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:07,489][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:07,491][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:07,492][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:08,550][__main__][INFO] - Iteration 6 took 17s (20.52% Gen, 73.51% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 43m 4s. Estimated total time: 14h 47m 23s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 34s, 500 more iterations: 2h 27m 53s. [2025-11-13 08:08:08,552][__main__][INFO] - Starting iteration 6. [2025-11-13 08:08:08,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:08,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:12,236][__main__][INFO] - Number of regex retries in iteration 6: 0 [2025-11-13 08:08:12,237][__main__][INFO] - agents played in iteration 6 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:08:12,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:12,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:12,782][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:13,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:13,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:14,444][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:15,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:15,425][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:15,748][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:16,080][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:16,735][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:17,387][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:17,711][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:18,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:20,010][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:20,665][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:22,298][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:22,622][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:22,948][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:23,273][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:23,600][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:23,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:24,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:25,398][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:25,400][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:25,401][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:26,402][__main__][INFO] - Iteration 7 took 17s (20.61% Gen, 73.77% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 47m 41s. Estimated total time: 14h 52m 18s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 44s, 500 more iterations: 2h 28m 43s. [2025-11-13 08:08:26,404][__main__][INFO] - Starting iteration 7. [2025-11-13 08:08:26,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:26,408][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:30,065][__main__][INFO] - Number of regex retries in iteration 7: 0 [2025-11-13 08:08:30,066][__main__][INFO] - agents played in iteration 7 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:08:30,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:30,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:30,614][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:31,328][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:32,279][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:33,256][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:33,906][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:34,233][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:34,883][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:35,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:36,192][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:36,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:36,842][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:37,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:37,489][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:37,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:38,136][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:39,112][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:39,764][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:40,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:40,739][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:41,388][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:41,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:08:42,413][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:08:43,160][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:43,162][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:43,164][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:44,172][__main__][INFO] - Iteration 8 took 17s (20.59% Gen, 73.73% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 43m 19s. Estimated total time: 14h 48m 14s. Time estimates for 10 more iterations: 2m 57s, 100 more iterations: 29m 36s, 500 more iterations: 2h 28m 2s. [2025-11-13 08:08:44,174][__main__][INFO] - Starting iteration 8. [2025-11-13 08:08:44,178][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:08:44,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:47,921][__main__][INFO] - Number of regex retries in iteration 8: 0 [2025-11-13 08:08:47,922][__main__][INFO] - agents played in iteration 8 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:08:48,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:48,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:48,470][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:08:49,458][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:08:49,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:08:50,112][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:08:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:08:50,762][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:08:51,089][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:08:51,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:08:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:08:52,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:08:52,388][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:08:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:08:53,037][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:08:53,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:08:53,687][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:08:54,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:08:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:08:54,667][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:08:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:08:55,317][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:08:55,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:08:55,967][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:08:56,291][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:08:56,619][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:08:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:08:57,268][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:08:57,592][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:08:57,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:08:58,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:08:58,570][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:08:58,895][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:08:59,221][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:08:59,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:00,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:00,977][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:00,979][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:00,980][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:02,213][__main__][INFO] - Iteration 9 took 18s (20.75% Gen, 72.40% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 56m 36s. Estimated total time: 15h 1m 49s. Time estimates for 10 more iterations: 3m 0s, 100 more iterations: 30m 3s, 500 more iterations: 2h 30m 18s. [2025-11-13 08:09:02,215][__main__][INFO] - Starting iteration 9. [2025-11-13 08:09:02,218][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:02,219][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:05,782][__main__][INFO] - Number of regex retries in iteration 9: 0 [2025-11-13 08:09:05,782][__main__][INFO] - agents played in iteration 9 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:09:06,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:06,348][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:06,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:07,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:07,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:08,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:08,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:09,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:09,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:10,607][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:10,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:11,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:11,583][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:11,907][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:12,565][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:12,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:13,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:13,546][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:13,871][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:14,853][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:16,157][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:17,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:17,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:18,149][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:18,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:18,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:18,895][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:20,113][__main__][INFO] - Iteration 10 took 17s (19.91% Gen, 73.27% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 49m 17s. Estimated total time: 14h 54m 48s. Time estimates for 10 more iterations: 2m 58s, 100 more iterations: 29m 49s, 500 more iterations: 2h 29m 8s. [2025-11-13 08:09:20,115][__main__][INFO] - Starting iteration 10. [2025-11-13 08:09:20,119][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-13 08:09:20,119][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:23,718][__main__][INFO] - Number of regex retries in iteration 10: 0 [2025-11-13 08:09:23,719][__main__][INFO] - agents played in iteration 10 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:09:24,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:24,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:24,261][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:25,256][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:25,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:25,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:26,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:26,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:26,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:27,213][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:27,537][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:28,185][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:28,510][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:29,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:29,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:29,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:30,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:30,467][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:31,117][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:31,767][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:32,741][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:33,392][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:33,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:34,046][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:34,697][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:35,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:36,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:36,989][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:36,990][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:36,992][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:39,079][__main__][INFO] - Iteration 11 took 18s (18.98% Gen, 70.00% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 42m 14s. Estimated total time: 15h 48m 4s. Time estimates for 10 more iterations: 3m 9s, 100 more iterations: 31m 36s, 500 more iterations: 2h 38m 0s. [2025-11-13 08:09:39,081][__main__][INFO] - Starting iteration 11. [2025-11-13 08:09:39,084][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:39,085][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:43,184][__main__][INFO] - Number of regex retries in iteration 11: 0 [2025-11-13 08:09:43,185][__main__][INFO] - agents played in iteration 11 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:09:43,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:43,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:43,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:44,440][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:09:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:09:45,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:09:45,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:09:45,713][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:09:46,039][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:09:46,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:09:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:09:47,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:09:47,346][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:09:47,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:09:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:09:48,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:09:48,647][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:09:48,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:09:49,303][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:09:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:09:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:09:50,276][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:09:50,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:09:50,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:09:51,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:09:51,578][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:09:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:09:52,227][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:09:52,553][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:09:52,879][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:09:53,203][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:09:53,528][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:09:53,853][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:09:54,178][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:09:54,502][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:09:54,827][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:09:55,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:09:56,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:09:56,237][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:09:56,239][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:09:57,279][__main__][INFO] - Iteration 12 took 18s (22.53% Gen, 71.74% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 3m 37s. Estimated total time: 15h 9m 45s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 19s, 500 more iterations: 2h 31m 37s. [2025-11-13 08:09:57,281][__main__][INFO] - Starting iteration 12. [2025-11-13 08:09:57,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:09:57,286][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:01,412][__main__][INFO] - Number of regex retries in iteration 12: 0 [2025-11-13 08:10:01,412][__main__][INFO] - agents played in iteration 12 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:10:01,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:01,964][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:01,964][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:02,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:02,979][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:03,304][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:03,629][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:03,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:04,610][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:04,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:05,594][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:06,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:07,237][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:07,896][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:08,227][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:08,555][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:08,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:09,210][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:09,539][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:10,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:10,521][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:11,183][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:11,516][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:11,843][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:12,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:13,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:13,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:14,650][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:14,651][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:14,653][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:15,665][__main__][INFO] - Iteration 13 took 18s (22.45% Gen, 72.03% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 12m 38s. Estimated total time: 15h 19m 5s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 38s, 500 more iterations: 2h 33m 10s. [2025-11-13 08:10:15,668][__main__][INFO] - Starting iteration 13. [2025-11-13 08:10:15,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:15,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:19,629][__main__][INFO] - Number of regex retries in iteration 13: 0 [2025-11-13 08:10:19,629][__main__][INFO] - agents played in iteration 13 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:10:20,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:20,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:20,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:20,892][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:21,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:22,166][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:22,817][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:23,141][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:23,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:24,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:24,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:24,769][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:25,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:25,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:25,743][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:26,068][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:26,392][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:26,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:27,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:27,368][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:27,694][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:28,020][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:28,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:28,998][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:29,325][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:30,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:30,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:30,949][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:31,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:31,969][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:32,692][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:32,694][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:32,695][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:33,794][__main__][INFO] - Iteration 14 took 18s (21.84% Gen, 72.09% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 59m 27s. Estimated total time: 15h 6m 11s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 12s, 500 more iterations: 2h 31m 1s. [2025-11-13 08:10:33,796][__main__][INFO] - Starting iteration 14. [2025-11-13 08:10:33,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:33,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:37,810][__main__][INFO] - Number of regex retries in iteration 14: 0 [2025-11-13 08:10:37,810][__main__][INFO] - agents played in iteration 14 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:10:38,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:38,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:38,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:39,088][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:39,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:39,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:40,034][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:40,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:41,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:41,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:10:41,985][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:10:42,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:10:42,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:10:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:10:43,294][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:10:43,620][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:10:43,945][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:10:44,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:10:44,598][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:10:44,924][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:10:45,248][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:10:45,573][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:10:45,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:10:46,227][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:10:46,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:10:46,880][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:10:47,213][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:10:47,541][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:10:47,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:10:48,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:10:48,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:10:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:10:49,168][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:10:49,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:10:50,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:10:50,912][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:50,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:50,915][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:51,931][__main__][INFO] - Iteration 15 took 18s (22.12% Gen, 72.27% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 59m 36s. Estimated total time: 15h 6m 38s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 13s, 500 more iterations: 2h 31m 6s. [2025-11-13 08:10:51,933][__main__][INFO] - Starting iteration 15. [2025-11-13 08:10:51,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:10:51,937][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:56,034][__main__][INFO] - Number of regex retries in iteration 15: 0 [2025-11-13 08:10:56,035][__main__][INFO] - agents played in iteration 15 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:10:56,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:56,593][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:56,593][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:57,303][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:10:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:10:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:10:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:10:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:10:58,901][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:10:59,225][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:10:59,551][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:10:59,880][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:00,205][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:00,857][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:01,182][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:01,514][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:01,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:02,486][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:02,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:03,134][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:03,459][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:05,081][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:06,063][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:06,715][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:07,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:07,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:08,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:09,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:09,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:09,103][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:10,194][__main__][INFO] - Iteration 16 took 18s (22.44% Gen, 71.58% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 5m 33s. Estimated total time: 15h 12m 54s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 25s, 500 more iterations: 2h 32m 9s. [2025-11-13 08:11:10,196][__main__][INFO] - Starting iteration 16. [2025-11-13 08:11:10,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:10,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:14,192][__main__][INFO] - Number of regex retries in iteration 16: 0 [2025-11-13 08:11:14,192][__main__][INFO] - agents played in iteration 16 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:11:14,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:14,744][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:14,744][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:15,762][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:16,087][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:16,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:17,070][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:17,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:17,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:18,704][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:19,029][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:19,678][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:20,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:20,327][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:20,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:20,980][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:21,305][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:21,960][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:22,287][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:22,615][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:22,940][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:23,266][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:24,241][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:24,892][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:25,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:25,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:26,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:27,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:27,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:27,321][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:28,451][__main__][INFO] - Iteration 17 took 18s (21.87% Gen, 71.93% Train). Generation: 3s, Training: 13s. Estimated remaining time: 15h 4m 58s. Estimated total time: 15h 12m 37s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 25s, 500 more iterations: 2h 32m 6s. [2025-11-13 08:11:28,453][__main__][INFO] - Starting iteration 17. [2025-11-13 08:11:28,457][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:28,458][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:32,532][__main__][INFO] - Number of regex retries in iteration 17: 0 [2025-11-13 08:11:32,533][__main__][INFO] - agents played in iteration 17 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:11:32,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:33,083][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:33,084][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:33,798][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:34,094][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:34,424][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:34,749][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:35,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:35,403][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:35,731][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:36,056][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:36,707][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:37,032][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:37,357][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:37,686][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:38,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:38,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:38,680][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:39,657][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:39,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:40,308][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:40,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:40,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:41,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:41,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:11:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:11:42,264][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:11:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:11:42,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:11:43,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:11:43,571][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:11:43,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:11:44,224][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:11:44,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:11:45,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:45,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:45,658][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:46,644][__main__][INFO] - Iteration 18 took 18s (22.41% Gen, 72.16% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 1m 25s. Estimated total time: 15h 9m 22s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 18s, 500 more iterations: 2h 31m 33s. [2025-11-13 08:11:46,646][__main__][INFO] - Starting iteration 18. [2025-11-13 08:11:46,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:11:46,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:50,636][__main__][INFO] - Number of regex retries in iteration 18: 0 [2025-11-13 08:11:50,637][__main__][INFO] - agents played in iteration 18 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:11:51,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,152][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:51,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:51,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:51,908][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:11:52,206][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:11:52,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:11:52,861][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:11:53,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:11:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:11:53,835][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:11:54,160][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:11:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:11:54,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:11:55,136][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:11:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:11:55,786][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:11:56,111][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:11:56,436][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:11:56,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:11:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:11:57,410][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:11:57,739][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:11:58,064][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:11:58,388][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:11:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:11:59,045][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:11:59,369][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:11:59,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:00,020][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:00,345][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:00,671][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:02,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:03,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:03,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:03,749][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:03,751][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:04,787][__main__][INFO] - Iteration 19 took 18s (21.98% Gen, 72.30% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 58m 41s. Estimated total time: 15h 6m 56s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 13s, 500 more iterations: 2h 31m 9s. [2025-11-13 08:12:04,789][__main__][INFO] - Starting iteration 19. [2025-11-13 08:12:04,801][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:04,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:08,828][__main__][INFO] - Number of regex retries in iteration 19: 0 [2025-11-13 08:12:08,829][__main__][INFO] - agents played in iteration 19 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:12:09,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:09,386][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:09,386][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:10,107][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:10,404][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:10,730][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:11,712][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:12,036][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:12,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:12,688][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:13,013][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:13,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:13,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:13,987][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:14,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:15,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:15,941][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:16,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:16,923][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:17,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:17,572][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:17,900][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:18,227][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:18,559][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:18,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:19,212][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:19,543][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:19,870][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:20,195][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:20,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:21,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:21,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:21,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:21,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:22,959][__main__][INFO] - Iteration 20 took 18s (22.17% Gen, 72.33% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 59m 22s. Estimated total time: 15h 7m 55s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 15s, 500 more iterations: 2h 31m 19s. [2025-11-13 08:12:22,961][__main__][INFO] - Starting iteration 20. [2025-11-13 08:12:22,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-13 08:12:22,965][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:27,048][__main__][INFO] - Number of regex retries in iteration 20: 0 [2025-11-13 08:12:27,049][__main__][INFO] - agents played in iteration 20 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:12:27,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:27,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:27,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:28,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:28,638][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:28,965][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:29,613][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:29,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:30,267][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:30,594][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:30,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:31,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:31,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:31,892][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:32,220][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:32,544][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:32,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:33,192][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:33,516][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:33,844][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:34,169][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:34,494][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:34,819][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:35,468][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:35,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:36,116][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:36,443][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:36,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:37,089][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:37,741][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:38,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:38,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:38,716][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:39,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:40,154][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:40,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:40,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:42,100][__main__][INFO] - Iteration 21 took 19s (21.34% Gen, 68.50% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 47m 54s. Estimated total time: 15h 56m 47s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 53s, 500 more iterations: 2h 39m 27s. [2025-11-13 08:12:42,102][__main__][INFO] - Starting iteration 21. [2025-11-13 08:12:42,105][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:12:42,106][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:46,354][__main__][INFO] - Number of regex retries in iteration 21: 0 [2025-11-13 08:12:46,355][__main__][INFO] - agents played in iteration 21 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:12:46,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:46,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:46,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:47,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:12:47,939][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:12:48,266][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:12:48,591][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:12:48,917][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:12:49,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:12:49,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:12:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:12:50,222][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:12:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:12:50,870][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:12:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:12:51,521][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:12:51,849][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:12:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:12:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:12:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:12:53,170][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:12:53,495][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:12:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:12:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:12:54,472][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:12:54,800][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:12:55,124][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:12:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:12:55,778][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:12:56,105][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:12:56,436][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:12:56,761][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:12:57,091][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:12:57,420][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:12:57,747][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:12:58,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:12:58,784][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:12:59,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:59,536][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:59,538][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:00,536][__main__][INFO] - Iteration 22 took 18s (23.05% Gen, 71.52% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 12m 24s. Estimated total time: 15h 21m 36s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 43s, 500 more iterations: 2h 33m 36s. [2025-11-13 08:13:00,538][__main__][INFO] - Starting iteration 22. [2025-11-13 08:13:00,541][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:00,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:04,546][__main__][INFO] - Number of regex retries in iteration 22: 0 [2025-11-13 08:13:04,547][__main__][INFO] - agents played in iteration 22 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:13:04,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:05,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:05,107][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:05,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:06,117][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:06,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:06,770][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:07,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:08,399][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:08,724][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:09,056][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:09,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:09,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:10,031][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:10,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:11,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:11,341][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:12,639][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:13,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:13,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:14,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:14,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:14,917][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:15,241][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:15,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:15,894][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:16,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:16,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:17,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:17,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:17,649][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:18,644][__main__][INFO] - Iteration 23 took 18s (22.12% Gen, 72.38% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 55m 40s. Estimated total time: 15h 5m 9s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 10s, 500 more iterations: 2h 30m 51s. [2025-11-13 08:13:18,646][__main__][INFO] - Starting iteration 23. [2025-11-13 08:13:18,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:18,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:22,696][__main__][INFO] - Number of regex retries in iteration 23: 0 [2025-11-13 08:13:22,697][__main__][INFO] - agents played in iteration 23 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:13:23,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:23,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:23,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:23,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:24,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:24,935][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:25,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:25,914][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:26,240][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:26,566][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:27,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:27,865][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:28,188][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:28,513][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:28,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:29,164][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:29,491][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:29,816][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:30,139][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:30,464][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:31,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:32,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:32,411][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:32,735][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:33,060][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:33,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:34,036][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:34,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:35,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:35,842][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:35,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:35,846][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:36,854][__main__][INFO] - Iteration 24 took 18s (22.23% Gen, 72.23% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 0m 28s. Estimated total time: 15h 10m 16s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 20s, 500 more iterations: 2h 31m 42s. [2025-11-13 08:13:36,856][__main__][INFO] - Starting iteration 24. [2025-11-13 08:13:36,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:36,860][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:40,981][__main__][INFO] - Number of regex retries in iteration 24: 0 [2025-11-13 08:13:40,982][__main__][INFO] - agents played in iteration 24 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:13:41,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:41,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:41,538][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:42,268][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:13:42,565][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:13:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:13:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:13:43,548][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:13:43,874][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:13:44,209][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:13:44,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:13:44,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:13:45,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:13:45,516][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:13:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:13:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:13:46,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:13:46,813][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:13:47,136][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:13:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:13:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:13:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:13:48,437][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:13:48,761][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:13:49,086][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:13:49,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:13:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:13:50,061][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:13:50,384][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:13:50,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:13:51,030][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:13:51,354][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:13:51,679][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:13:52,004][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:13:52,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:13:52,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:13:53,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:13:54,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:54,117][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:54,119][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:55,304][__main__][INFO] - Iteration 25 took 18s (22.34% Gen, 71.22% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 12m 11s. Estimated total time: 15h 22m 17s. Time estimates for 10 more iterations: 3m 4s, 100 more iterations: 30m 44s, 500 more iterations: 2h 33m 42s. [2025-11-13 08:13:55,306][__main__][INFO] - Starting iteration 25. [2025-11-13 08:13:55,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:13:55,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:59,334][__main__][INFO] - Number of regex retries in iteration 25: 0 [2025-11-13 08:13:59,335][__main__][INFO] - agents played in iteration 25 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:13:59,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:59,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:59,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:01,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:01,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:01,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:02,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:02,855][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:03,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:03,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:04,159][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:04,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:04,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:05,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:05,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:06,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:06,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:07,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:07,742][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:08,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:08,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:09,369][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:09,693][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:10,343][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:10,666][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:10,990][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:11,723][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:12,459][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:12,461][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:12,462][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:13,504][__main__][INFO] - Iteration 26 took 18s (22.12% Gen, 72.15% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 59m 22s. Estimated total time: 15h 9m 46s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 19s, 500 more iterations: 2h 31m 37s. [2025-11-13 08:14:13,507][__main__][INFO] - Starting iteration 26. [2025-11-13 08:14:13,510][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:13,510][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:17,482][__main__][INFO] - Number of regex retries in iteration 26: 0 [2025-11-13 08:14:17,483][__main__][INFO] - agents played in iteration 26 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:14:17,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:17,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:18,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:18,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:19,069][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:20,047][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:20,373][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:20,698][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:21,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:21,670][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:21,995][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:22,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:22,642][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:23,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:24,264][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:24,591][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:24,916][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:25,240][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:25,567][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:26,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:26,877][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:27,527][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:27,852][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:28,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:28,503][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:28,827][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:29,152][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:29,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:30,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:30,595][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:30,597][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:31,641][__main__][INFO] - Iteration 27 took 18s (21.91% Gen, 72.33% Train). Generation: 3s, Training: 13s. Estimated remaining time: 14h 55m 55s. Estimated total time: 15h 6m 37s. Time estimates for 10 more iterations: 3m 1s, 100 more iterations: 30m 13s, 500 more iterations: 2h 31m 6s. [2025-11-13 08:14:31,644][__main__][INFO] - Starting iteration 27. [2025-11-13 08:14:31,648][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:31,648][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:35,813][__main__][INFO] - Number of regex retries in iteration 27: 0 [2025-11-13 08:14:35,813][__main__][INFO] - agents played in iteration 27 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:14:36,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:36,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:36,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:37,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:37,725][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:38,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:38,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:39,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:40,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:40,979][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:41,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:41,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:14:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:14:42,281][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:14:42,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:14:42,933][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:14:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:14:43,585][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:14:43,909][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:14:44,235][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:14:44,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:14:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:14:45,209][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:14:45,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:14:45,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:14:46,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:14:46,513][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:14:46,838][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:14:47,163][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:14:47,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:14:48,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:14:48,950][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:48,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:48,953][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:49,948][__main__][INFO] - Iteration 28 took 18s (22.76% Gen, 71.79% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 4m 3s. Estimated total time: 15h 15m 4s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 30s, 500 more iterations: 2h 32m 30s. [2025-11-13 08:14:49,950][__main__][INFO] - Starting iteration 28. [2025-11-13 08:14:49,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:14:49,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:54,060][__main__][INFO] - Number of regex retries in iteration 28: 0 [2025-11-13 08:14:54,061][__main__][INFO] - agents played in iteration 28 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:14:54,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:54,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:54,624][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:55,348][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:14:55,644][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:14:55,970][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:14:56,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:14:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:14:56,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:14:57,265][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:14:57,590][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:14:57,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:14:58,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:14:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:14:58,889][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:14:59,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:14:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:14:59,870][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:00,196][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:00,521][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:00,846][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:01,170][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:01,493][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:01,818][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:02,792][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:03,440][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:03,768][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:04,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:04,744][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:05,071][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:05,396][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:05,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:06,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:07,191][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:07,192][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:07,194][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:08,177][__main__][INFO] - Iteration 29 took 18s (22.53% Gen, 72.07% Train). Generation: 4s, Training: 13s. Estimated remaining time: 14h 59m 54s. Estimated total time: 15h 11m 13s. Time estimates for 10 more iterations: 3m 2s, 100 more iterations: 30m 22s, 500 more iterations: 2h 31m 52s. [2025-11-13 08:15:08,179][__main__][INFO] - Starting iteration 29. [2025-11-13 08:15:08,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:08,183][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:12,370][__main__][INFO] - Number of regex retries in iteration 29: 0 [2025-11-13 08:15:12,371][__main__][INFO] - agents played in iteration 29 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:15:12,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:12,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:12,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:12,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:12,938][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:12,938][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:13,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:13,962][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:14,292][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:14,619][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:15,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:15,932][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:16,256][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:16,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:17,235][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:17,566][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:17,899][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:18,223][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:18,877][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:19,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:19,535][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:20,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:20,514][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:20,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:21,167][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:21,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:21,820][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:22,147][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:22,472][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:22,798][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:23,124][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:23,456][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:24,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:24,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:25,554][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:25,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:25,557][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:26,559][__main__][INFO] - Iteration 30 took 18s (22.79% Gen, 71.75% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 7m 13s. Estimated total time: 15h 18m 51s. Time estimates for 10 more iterations: 3m 3s, 100 more iterations: 30m 37s, 500 more iterations: 2h 33m 8s. [2025-11-13 08:15:26,561][__main__][INFO] - Starting iteration 30. [2025-11-13 08:15:26,564][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-13 08:15:26,565][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:30,647][__main__][INFO] - Number of regex retries in iteration 30: 0 [2025-11-13 08:15:30,648][__main__][INFO] - agents played in iteration 30 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:15:31,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:31,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:31,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:31,931][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:32,231][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:33,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:34,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:34,506][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:34,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:35,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:35,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:35,804][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:36,461][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:36,788][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:37,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:37,444][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:37,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:38,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:38,746][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:39,393][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:15:40,042][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:15:40,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:15:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:15:41,014][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:15:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:15:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:15:41,991][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:15:42,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:15:43,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:15:43,728][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:43,729][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:43,731][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:45,666][__main__][INFO] - Iteration 31 took 19s (21.37% Gen, 68.49% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 43m 11s. Estimated total time: 15h 55m 8s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 50s, 500 more iterations: 2h 39m 11s. [2025-11-13 08:15:45,668][__main__][INFO] - Starting iteration 31. [2025-11-13 08:15:45,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:15:45,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:50,611][__main__][INFO] - Number of regex retries in iteration 31: 0 [2025-11-13 08:15:50,612][__main__][INFO] - agents played in iteration 31 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:15:51,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:51,169][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:51,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:51,906][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:15:52,201][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:15:52,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:15:52,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:15:53,186][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:15:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:15:53,837][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:15:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:15:54,490][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:15:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:15:55,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:15:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:15:55,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:15:56,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:15:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:15:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:15:57,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:15:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:15:57,761][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:15:58,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:15:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:15:58,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:15:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:15:59,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:15:59,731][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:00,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:00,387][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:00,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:01,037][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:01,362][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:01,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:02,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:02,343][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:03,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:03,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:03,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:03,824][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:04,884][__main__][INFO] - Iteration 32 took 19s (25.70% Gen, 68.77% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 48m 23s. Estimated total time: 16h 0m 38s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 1s, 500 more iterations: 2h 40m 6s. [2025-11-13 08:16:04,886][__main__][INFO] - Starting iteration 32. [2025-11-13 08:16:04,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:04,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:09,339][__main__][INFO] - Number of regex retries in iteration 32: 0 [2025-11-13 08:16:09,340][__main__][INFO] - agents played in iteration 32 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:16:09,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:09,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:09,917][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:10,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:10,932][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:11,257][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:11,584][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:11,911][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:12,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:12,567][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:13,214][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:14,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:14,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:14,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:15,171][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:15,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:15,821][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:16,146][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:16,471][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:16,798][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:17,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:17,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:17,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:18,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:18,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:19,406][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:20,054][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:20,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:21,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:21,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:22,473][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:22,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:22,477][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:23,463][__main__][INFO] - Iteration 33 took 18s (23.96% Gen, 70.73% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 16m 10s. Estimated total time: 15h 28m 44s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 57s, 500 more iterations: 2h 34m 47s. [2025-11-13 08:16:23,466][__main__][INFO] - Starting iteration 33. [2025-11-13 08:16:23,469][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:23,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:26,847][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2025-11-13 08:16:28,580][__main__][INFO] - Number of regex retries in iteration 33: 1 [2025-11-13 08:16:28,581][__main__][INFO] - agents played in iteration 33 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:16:29,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:29,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:29,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:29,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:29,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:29,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:30,094][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:30,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:30,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:31,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:32,027][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:32,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:32,683][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:33,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:34,296][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:35,908][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:36,231][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:36,553][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:36,874][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:37,843][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:38,165][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:38,486][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:39,455][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:39,778][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:40,101][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:40,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:16:41,541][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:41,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:41,544][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:42,563][__main__][INFO] - Iteration 34 took 19s (26.77% Gen, 67.88% Train). Generation: 5s, Training: 12s. Estimated remaining time: 15h 41m 53s. Estimated total time: 15h 54m 46s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 49s, 500 more iterations: 2h 39m 7s. [2025-11-13 08:16:42,566][__main__][INFO] - Starting iteration 34. [2025-11-13 08:16:42,568][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:16:42,568][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:47,133][__main__][INFO] - Number of regex retries in iteration 34: 0 [2025-11-13 08:16:47,133][__main__][INFO] - agents played in iteration 34 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:16:47,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:16:47,656][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:16:47,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:16:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:16:48,644][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:16:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:16:49,289][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:16:49,611][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:16:49,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:16:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:16:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:16:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:16:51,237][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:16:51,561][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:16:51,888][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:16:52,213][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:16:52,536][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:16:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:16:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:16:53,508][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:16:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:16:54,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:16:54,476][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:16:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:16:55,128][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:16:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:16:55,779][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:16:56,104][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:16:56,427][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:16:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:16:57,076][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:16:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:16:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:16:58,051][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:16:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:16:58,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:16:59,398][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:00,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:00,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:00,134][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:01,117][__main__][INFO] - Iteration 35 took 18s (24.61% Gen, 70.09% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 14m 17s. Estimated total time: 15h 27m 29s. Time estimates for 10 more iterations: 3m 5s, 100 more iterations: 30m 54s, 500 more iterations: 2h 34m 34s. [2025-11-13 08:17:01,119][__main__][INFO] - Starting iteration 35. [2025-11-13 08:17:01,123][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:01,123][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:05,825][__main__][INFO] - Number of regex retries in iteration 35: 0 [2025-11-13 08:17:05,825][__main__][INFO] - agents played in iteration 35 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:17:06,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:06,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:06,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:07,672][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:07,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:08,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:08,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:09,283][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:09,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:09,927][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:10,249][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:10,894][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:11,542][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:12,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:12,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:12,836][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:13,159][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:13,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:13,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:14,125][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:14,449][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:15,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:15,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:15,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:16,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:16,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:16,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:17,047][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:17,368][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:18,095][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:18,822][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:18,823][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:18,825][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:19,802][__main__][INFO] - Iteration 36 took 18s (25.17% Gen, 69.59% Train). Generation: 4s, Training: 12s. Estimated remaining time: 15h 20m 30s. Estimated total time: 15h 34m 0s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 8s, 500 more iterations: 2h 35m 40s. [2025-11-13 08:17:19,804][__main__][INFO] - Starting iteration 36. [2025-11-13 08:17:19,807][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:19,808][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:24,419][__main__][INFO] - Number of regex retries in iteration 36: 0 [2025-11-13 08:17:24,420][__main__][INFO] - agents played in iteration 36 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:17:24,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:24,959][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:24,959][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:25,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:25,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:26,283][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:26,605][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:26,928][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:27,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:27,907][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:28,229][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:28,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:29,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:29,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:30,167][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:30,492][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:31,138][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:31,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:31,782][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:32,105][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:32,751][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:33,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:33,394][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:33,716][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:34,691][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:35,013][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:35,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:35,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:36,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:37,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:37,433][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:37,434][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:38,535][__main__][INFO] - Iteration 37 took 18s (24.62% Gen, 69.49% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 22m 36s. Estimated total time: 15h 36m 25s. Time estimates for 10 more iterations: 3m 7s, 100 more iterations: 31m 12s, 500 more iterations: 2h 36m 4s. [2025-11-13 08:17:38,538][__main__][INFO] - Starting iteration 37. [2025-11-13 08:17:38,541][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:38,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:17:43,081][__main__][INFO] - Number of regex retries in iteration 37: 0 [2025-11-13 08:17:43,082][__main__][INFO] - agents played in iteration 37 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:17:43,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:43,635][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:43,636][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:17:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:17:44,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:17:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:17:45,619][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:17:45,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:17:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:17:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:17:46,921][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:17:47,243][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:17:47,564][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:17:47,888][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:17:48,210][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:17:48,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:17:48,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:17:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:17:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:17:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:17:50,145][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:17:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:17:50,788][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:17:51,110][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:17:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:17:51,761][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:17:52,085][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:17:52,408][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:17:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:17:53,054][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:17:53,375][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:17:53,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:17:54,024][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:17:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:17:54,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:17:55,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:17:56,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:56,083][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:56,084][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:57,157][__main__][INFO] - Iteration 38 took 18s (24.39% Gen, 69.84% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 16m 42s. Estimated total time: 15h 30m 50s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 1s, 500 more iterations: 2h 35m 8s. [2025-11-13 08:17:57,159][__main__][INFO] - Starting iteration 38. [2025-11-13 08:17:57,163][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:17:57,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:01,807][__main__][INFO] - Number of regex retries in iteration 38: 0 [2025-11-13 08:18:01,807][__main__][INFO] - agents played in iteration 38 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:18:02,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:02,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:02,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:03,071][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:03,366][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:03,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:04,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:04,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:04,982][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:05,304][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:05,625][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:05,950][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:06,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:06,917][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:07,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:07,885][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:08,208][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:08,531][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:08,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:09,176][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:10,146][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:10,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:10,794][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:11,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:12,090][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:12,418][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:13,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:13,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:14,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:14,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:14,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:14,830][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:15,824][__main__][INFO] - Iteration 39 took 18s (24.88% Gen, 69.78% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 18m 40s. Estimated total time: 15h 33m 7s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 6s, 500 more iterations: 2h 35m 31s. [2025-11-13 08:18:15,826][__main__][INFO] - Starting iteration 39. [2025-11-13 08:18:15,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:15,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:20,406][__main__][INFO] - Number of regex retries in iteration 39: 0 [2025-11-13 08:18:20,407][__main__][INFO] - agents played in iteration 39 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:18:20,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:20,963][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:20,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:21,677][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:22,296][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:22,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:22,945][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:23,267][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:23,590][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:24,553][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:24,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:25,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:25,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:26,170][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:26,815][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:27,138][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:27,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:27,781][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:28,106][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:28,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:28,749][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:29,072][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:29,395][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:30,042][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:30,366][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:31,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:31,658][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:31,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:32,675][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:33,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:33,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:33,400][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:34,443][__main__][INFO] - Iteration 40 took 18s (24.59% Gen, 69.80% Train). Generation: 4s, Training: 12s. Estimated remaining time: 15h 15m 59s. Estimated total time: 15h 30m 44s. Time estimates for 10 more iterations: 3m 6s, 100 more iterations: 31m 1s, 500 more iterations: 2h 35m 7s. [2025-11-13 08:18:34,445][__main__][INFO] - Starting iteration 40. [2025-11-13 08:18:34,449][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-13 08:18:34,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:39,088][__main__][INFO] - Number of regex retries in iteration 40: 0 [2025-11-13 08:18:39,089][__main__][INFO] - agents played in iteration 40 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:18:39,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:39,636][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:39,637][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:40,346][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:18:40,641][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:18:40,964][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:18:41,286][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:18:41,612][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:18:41,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:18:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:18:42,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:18:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:18:43,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:18:43,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:18:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:18:44,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:18:44,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:18:44,864][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:18:45,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:18:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:18:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:18:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:18:46,489][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:18:46,814][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:18:47,138][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:18:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:18:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:18:48,112][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:18:48,436][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:18:48,759][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:18:49,084][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:18:49,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:18:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:18:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:18:50,387][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:18:50,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:18:51,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:18:52,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:52,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:52,146][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:54,096][__main__][INFO] - Iteration 41 took 19s (23.61% Gen, 66.46% Train). Generation: 4s, Training: 13s. Estimated remaining time: 16h 7m 17s. Estimated total time: 16h 22m 22s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 44s, 500 more iterations: 2h 43m 43s. [2025-11-13 08:18:54,098][__main__][INFO] - Starting iteration 41. [2025-11-13 08:18:54,101][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:18:54,102][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:59,330][__main__][INFO] - Number of regex retries in iteration 41: 0 [2025-11-13 08:18:59,330][__main__][INFO] - agents played in iteration 41 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:18:59,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,841][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:59,875][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:59,876][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:00,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:00,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:01,212][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:01,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:02,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:02,825][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:03,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:03,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:03,794][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:04,115][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:05,082][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:05,405][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:06,048][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:06,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:06,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:07,027][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:07,356][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:07,677][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:08,325][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:08,648][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:09,295][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:09,618][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:09,944][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:10,271][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:10,596][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:10,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:11,629][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:12,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:12,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:12,358][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:13,481][__main__][INFO] - Iteration 42 took 19s (26.98% Gen, 67.22% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 53m 37s. Estimated total time: 16h 9m 2s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 18s, 500 more iterations: 2h 41m 30s. [2025-11-13 08:19:13,484][__main__][INFO] - Starting iteration 42. [2025-11-13 08:19:13,487][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:13,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:18,396][__main__][INFO] - Number of regex retries in iteration 42: 0 [2025-11-13 08:19:18,396][__main__][INFO] - agents played in iteration 42 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:19:18,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:18,936][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:18,936][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:19,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:19,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:20,911][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:21,557][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:21,879][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:22,202][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:22,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:22,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:23,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:23,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:23,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:24,136][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:24,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:24,779][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:25,101][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:25,746][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:26,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:26,390][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:26,713][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:27,034][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:27,357][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:27,680][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:28,001][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:28,322][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:28,644][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:28,967][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:29,291][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:29,614][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:29,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:30,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:31,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:31,372][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:31,374][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:32,518][__main__][INFO] - Iteration 43 took 19s (25.79% Gen, 68.19% Train). Generation: 4s, Training: 12s. Estimated remaining time: 15h 35m 54s. Estimated total time: 15h 51m 37s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 43s, 500 more iterations: 2h 38m 36s. [2025-11-13 08:19:32,521][__main__][INFO] - Starting iteration 43. [2025-11-13 08:19:32,524][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:32,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:37,653][__main__][INFO] - Number of regex retries in iteration 43: 0 [2025-11-13 08:19:37,653][__main__][INFO] - agents played in iteration 43 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:19:38,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:38,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:38,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:38,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:39,210][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:39,534][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:40,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:40,501][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:40,828][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:19:41,153][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:19:41,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:19:41,799][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:19:42,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:19:42,450][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:19:42,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:19:43,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:19:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:19:43,752][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:19:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:19:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:19:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:19:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:19:45,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:19:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:19:46,023][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:19:46,350][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:19:46,674][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:19:47,000][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:19:47,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:19:47,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:19:47,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:19:48,295][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:19:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:19:48,948][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:19:49,271][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:19:49,979][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:19:50,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:19:50,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:19:50,703][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:19:51,766][__main__][INFO] - Iteration 44 took 19s (26.65% Gen, 67.81% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 46m 7s. Estimated total time: 16h 2m 9s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 4s, 500 more iterations: 2h 40m 21s. [2025-11-13 08:19:51,768][__main__][INFO] - Starting iteration 44. [2025-11-13 08:19:51,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:19:51,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:56,799][__main__][INFO] - Number of regex retries in iteration 44: 0 [2025-11-13 08:19:56,799][__main__][INFO] - agents played in iteration 44 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:19:57,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:57,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:57,349][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:58,074][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:19:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:19:58,693][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:19:59,014][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:19:59,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:19:59,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:19:59,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:00,629][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:01,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:01,596][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:02,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:02,885][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:03,530][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:03,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:04,176][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:04,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:05,148][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:05,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:05,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:06,116][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:06,438][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:06,761][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:07,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:07,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:07,730][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:08,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:08,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:09,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:09,817][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:09,819][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:09,820][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:10,810][__main__][INFO] - Iteration 45 took 19s (26.41% Gen, 68.39% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 35m 36s. Estimated total time: 15h 51m 58s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 43s, 500 more iterations: 2h 38m 39s. [2025-11-13 08:20:10,812][__main__][INFO] - Starting iteration 45. [2025-11-13 08:20:10,815][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:10,815][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:15,829][__main__][INFO] - Number of regex retries in iteration 45: 0 [2025-11-13 08:20:15,830][__main__][INFO] - agents played in iteration 45 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:20:16,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:16,382][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:16,383][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:17,095][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:17,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:17,711][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:18,033][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:18,356][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:18,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:19,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:19,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:20,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:20,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:20,935][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:21,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:21,579][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:22,546][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:24,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:24,802][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:25,124][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:25,767][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:26,089][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:26,733][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:27,055][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:27,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:28,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:28,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:28,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:28,830][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:29,823][__main__][INFO] - Iteration 46 took 19s (26.38% Gen, 68.39% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 33m 44s. Estimated total time: 15h 50m 25s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 40s, 500 more iterations: 2h 38m 24s. [2025-11-13 08:20:29,825][__main__][INFO] - Starting iteration 46. [2025-11-13 08:20:29,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:29,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:34,909][__main__][INFO] - Number of regex retries in iteration 46: 0 [2025-11-13 08:20:34,910][__main__][INFO] - agents played in iteration 46 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:20:35,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:35,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:35,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:36,194][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:36,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:36,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:37,135][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:37,778][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:38,423][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:38,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:39,390][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:39,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:40,037][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:40,359][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:40,681][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:20:41,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:20:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:20:41,650][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:20:41,974][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:20:42,297][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:20:42,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:20:42,941][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:20:43,263][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:20:43,584][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:20:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:20:44,228][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:20:44,550][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:20:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:20:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:20:45,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:20:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:20:46,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:20:46,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:20:47,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:20:47,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:47,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:47,951][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:49,125][__main__][INFO] - Iteration 47 took 19s (26.33% Gen, 67.58% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 47m 51s. Estimated total time: 16h 4m 51s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 9s, 500 more iterations: 2h 40m 48s. [2025-11-13 08:20:49,127][__main__][INFO] - Starting iteration 47. [2025-11-13 08:20:49,129][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:20:49,130][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:54,122][__main__][INFO] - Number of regex retries in iteration 47: 0 [2025-11-13 08:20:54,123][__main__][INFO] - agents played in iteration 47 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:20:54,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:54,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:54,692][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:55,402][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:20:55,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:20:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:20:56,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:20:56,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:20:56,999][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:20:57,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:20:57,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:20:57,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:20:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:20:58,626][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:20:58,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:20:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:20:59,595][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:20:59,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:00,247][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:00,571][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:01,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:01,541][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:01,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:02,189][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:02,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:02,836][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:03,157][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:03,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:03,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:04,140][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:04,465][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:05,112][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:05,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:06,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:07,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:07,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:07,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:08,193][__main__][INFO] - Iteration 48 took 19s (26.19% Gen, 68.63% Train). Generation: 4s, Training: 13s. Estimated remaining time: 15h 35m 55s. Estimated total time: 15h 53m 14s. Time estimates for 10 more iterations: 3m 10s, 100 more iterations: 31m 46s, 500 more iterations: 2h 38m 52s. [2025-11-13 08:21:08,195][__main__][INFO] - Starting iteration 48. [2025-11-13 08:21:08,198][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:08,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:13,202][__main__][INFO] - Number of regex retries in iteration 48: 0 [2025-11-13 08:21:13,202][__main__][INFO] - agents played in iteration 48 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:21:13,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:13,755][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:13,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:14,767][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:15,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:15,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:15,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:16,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:16,711][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:17,034][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:17,355][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:18,001][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:18,327][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:18,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:18,971][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:19,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:20,911][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:21,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:21,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:22,208][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:22,852][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:23,496][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:23,818][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:24,141][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:24,463][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:24,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:25,506][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:26,261][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:26,262][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:26,264][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:27,317][__main__][INFO] - Iteration 49 took 19s (26.17% Gen, 68.31% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 38m 21s. Estimated total time: 15h 55m 59s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 51s, 500 more iterations: 2h 39m 19s. [2025-11-13 08:21:27,320][__main__][INFO] - Starting iteration 49. [2025-11-13 08:21:27,322][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:27,323][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:32,413][__main__][INFO] - Number of regex retries in iteration 49: 0 [2025-11-13 08:21:32,413][__main__][INFO] - agents played in iteration 49 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:21:32,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:32,958][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:32,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:33,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:33,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:34,627][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:34,950][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:35,602][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:35,924][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:36,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:36,571][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:36,897][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:37,219][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:37,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:38,191][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:38,836][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:39,160][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:39,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:39,807][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:40,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:40,775][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:21:41,098][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:21:41,422][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:21:41,746][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:21:42,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:21:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:21:42,715][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:21:43,039][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:21:43,362][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:21:43,684][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:21:44,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:21:44,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:21:45,447][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:45,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:45,450][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:46,501][__main__][INFO] - Iteration 50 took 19s (26.54% Gen, 67.97% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 40m 59s. Estimated total time: 15h 58m 57s. Time estimates for 10 more iterations: 3m 11s, 100 more iterations: 31m 57s, 500 more iterations: 2h 39m 49s. [2025-11-13 08:21:46,502][__main__][INFO] - Starting iteration 50. [2025-11-13 08:21:46,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-13 08:21:46,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:51,550][__main__][INFO] - Number of regex retries in iteration 50: 0 [2025-11-13 08:21:51,551][__main__][INFO] - agents played in iteration 50 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:21:52,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:52,103][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:52,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:52,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:21:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:21:53,444][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:21:53,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:21:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:21:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:21:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:21:55,060][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:21:55,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:21:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:21:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:21:56,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:21:56,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:21:56,996][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:21:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:21:57,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:21:57,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:21:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:21:58,611][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:21:58,934][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:21:59,257][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:21:59,579][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:21:59,903][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:00,225][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:00,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:00,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:01,192][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:01,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:02,486][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:02,812][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:03,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:03,842][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:04,588][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:04,590][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:04,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:06,518][__main__][INFO] - Iteration 51 took 20s (25.21% Gen, 65.16% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 22m 24s. Estimated total time: 16h 40m 41s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 21s, 500 more iterations: 2h 46m 46s. [2025-11-13 08:22:06,521][__main__][INFO] - Starting iteration 51. [2025-11-13 08:22:06,524][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:06,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:12,037][__main__][INFO] - Number of regex retries in iteration 51: 0 [2025-11-13 08:22:12,038][__main__][INFO] - agents played in iteration 51 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:22:12,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:12,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:12,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:13,581][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:13,905][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:14,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:14,551][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:14,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:15,197][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:15,845][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:16,172][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:16,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:16,817][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:17,140][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:17,462][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:17,785][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:18,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:18,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:19,082][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:19,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:19,733][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:20,055][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:20,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:21,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:21,350][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:22,325][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:22,647][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:22,969][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:23,291][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:23,614][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:24,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:25,072][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:25,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:25,075][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:26,125][__main__][INFO] - Iteration 52 took 19s (28.13% Gen, 66.51% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 1m 27s. Estimated total time: 16h 20m 4s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 40s, 500 more iterations: 2h 43m 20s. [2025-11-13 08:22:26,127][__main__][INFO] - Starting iteration 52. [2025-11-13 08:22:26,130][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:26,130][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:31,444][__main__][INFO] - Number of regex retries in iteration 52: 0 [2025-11-13 08:22:31,445][__main__][INFO] - agents played in iteration 52 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:22:31,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:31,991][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:31,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:32,710][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:33,328][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:33,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:33,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:34,297][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:34,620][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:35,265][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:35,587][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:35,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:36,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:37,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:37,537][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:37,861][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:39,164][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:39,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:39,808][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:40,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:40,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:22:40,774][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:22:41,097][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:22:41,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:22:41,744][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:22:42,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:22:42,390][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:22:42,714][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:22:43,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:22:43,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:22:44,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:44,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:44,486][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:45,574][__main__][INFO] - Iteration 53 took 19s (27.33% Gen, 67.07% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 53m 19s. Estimated total time: 16h 12m 16s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 24s, 500 more iterations: 2h 42m 2s. [2025-11-13 08:22:45,577][__main__][INFO] - Starting iteration 53. [2025-11-13 08:22:45,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:22:45,581][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:50,928][__main__][INFO] - Number of regex retries in iteration 53: 0 [2025-11-13 08:22:50,929][__main__][INFO] - agents played in iteration 53 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:22:51,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:51,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:51,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:52,191][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:22:52,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:22:52,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:22:53,134][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:22:53,458][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:22:53,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:22:54,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:22:54,430][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:22:54,753][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:22:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:22:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:22:55,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:22:56,052][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:22:56,378][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:22:56,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:22:57,027][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:22:57,351][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:22:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:22:57,996][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:22:58,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:22:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:22:58,966][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:22:59,291][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:22:59,616][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:22:59,939][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:00,261][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:00,585][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:00,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:01,232][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:01,554][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:01,877][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:02,202][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:02,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:03,234][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:03,969][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:03,971][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:03,972][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:05,032][__main__][INFO] - Iteration 54 took 19s (27.49% Gen, 67.05% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 53m 22s. Estimated total time: 16h 12m 38s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 25s, 500 more iterations: 2h 42m 6s. [2025-11-13 08:23:05,034][__main__][INFO] - Starting iteration 54. [2025-11-13 08:23:05,038][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:05,038][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:10,458][__main__][INFO] - Number of regex retries in iteration 54: 0 [2025-11-13 08:23:10,459][__main__][INFO] - agents played in iteration 54 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:23:10,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:10,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:11,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:11,003][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:11,003][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:11,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:12,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:12,657][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:12,980][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:13,626][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:13,948][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:14,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:14,922][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:15,248][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:15,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:16,222][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:16,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:17,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:18,162][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:18,484][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:19,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:19,780][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:20,749][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:21,074][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:22,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:22,768][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:23,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:23,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:23,510][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:24,573][__main__][INFO] - Iteration 55 took 19s (27.75% Gen, 66.81% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 57m 14s. Estimated total time: 16h 16m 49s. Time estimates for 10 more iterations: 3m 15s, 100 more iterations: 32m 33s, 500 more iterations: 2h 42m 48s. [2025-11-13 08:23:24,576][__main__][INFO] - Starting iteration 55. [2025-11-13 08:23:24,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:24,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:29,891][__main__][INFO] - Number of regex retries in iteration 55: 0 [2025-11-13 08:23:29,892][__main__][INFO] - agents played in iteration 55 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:23:30,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:30,433][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:30,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:31,440][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:32,412][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:32,735][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:33,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:33,706][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:34,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:34,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:34,676][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:35,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:35,325][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:35,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:36,297][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:37,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:37,601][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:37,927][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:38,249][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:38,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:38,897][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:39,546][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:39,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:23:40,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:23:41,170][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:23:41,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:23:42,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:23:42,962][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:42,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:42,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:44,001][__main__][INFO] - Iteration 56 took 19s (27.35% Gen, 67.31% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 51m 14s. Estimated total time: 16h 11m 9s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 22s, 500 more iterations: 2h 41m 51s. [2025-11-13 08:23:44,004][__main__][INFO] - Starting iteration 56. [2025-11-13 08:23:44,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:23:44,007][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:49,269][__main__][INFO] - Number of regex retries in iteration 56: 0 [2025-11-13 08:23:49,270][__main__][INFO] - agents played in iteration 56 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:23:49,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:49,832][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:49,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:50,565][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:23:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:23:51,185][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:23:51,507][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:23:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:23:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:23:52,476][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:23:52,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:23:53,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:23:53,449][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:23:53,773][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:23:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:23:54,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:23:54,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:23:55,064][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:23:55,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:23:55,710][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:23:56,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:23:56,358][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:23:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:23:57,005][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:23:57,327][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:23:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:23:57,973][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:23:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:23:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:23:58,944][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:23:59,267][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:23:59,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:23:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:00,559][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:00,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:01,613][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:02,348][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:02,349][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:02,351][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:03,322][__main__][INFO] - Iteration 57 took 19s (27.24% Gen, 67.72% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 45m 33s. Estimated total time: 16h 5m 47s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 11s, 500 more iterations: 2h 40m 57s. [2025-11-13 08:24:03,324][__main__][INFO] - Starting iteration 57. [2025-11-13 08:24:03,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:03,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:08,684][__main__][INFO] - Number of regex retries in iteration 57: 0 [2025-11-13 08:24:08,684][__main__][INFO] - agents played in iteration 57 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:24:09,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:09,262][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:09,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:09,968][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:10,262][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:10,589][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:10,914][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:11,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:11,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:11,892][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:12,543][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:12,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:13,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:13,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:13,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:14,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:14,482][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:14,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:15,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:16,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:16,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:16,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:17,073][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:17,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:18,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:18,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:18,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:19,010][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:19,333][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:19,657][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:19,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:20,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:21,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:21,747][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:21,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:21,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:22,740][__main__][INFO] - Iteration 58 took 19s (27.59% Gen, 67.30% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 50m 9s. Estimated total time: 16h 10m 42s. Time estimates for 10 more iterations: 3m 14s, 100 more iterations: 32m 21s, 500 more iterations: 2h 41m 47s. [2025-11-13 08:24:22,743][__main__][INFO] - Starting iteration 58. [2025-11-13 08:24:22,747][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:22,747][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:28,017][__main__][INFO] - Number of regex retries in iteration 58: 0 [2025-11-13 08:24:28,017][__main__][INFO] - agents played in iteration 58 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:24:28,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:28,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:28,564][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:29,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:29,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:30,218][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:30,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:30,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:31,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:31,835][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:32,158][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:32,481][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:32,805][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:33,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:33,452][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:33,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:34,098][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:34,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:34,744][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:35,066][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:35,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:36,036][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:36,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:36,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:37,011][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:37,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:37,659][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:38,630][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:39,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:39,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:40,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:24:41,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:41,028][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:41,030][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:42,045][__main__][INFO] - Iteration 59 took 19s (27.31% Gen, 67.43% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 44m 3s. Estimated total time: 16h 4m 56s. Time estimates for 10 more iterations: 3m 12s, 100 more iterations: 32m 9s, 500 more iterations: 2h 40m 49s. [2025-11-13 08:24:42,047][__main__][INFO] - Starting iteration 59. [2025-11-13 08:24:42,051][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:24:42,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:47,328][__main__][INFO] - Number of regex retries in iteration 59: 0 [2025-11-13 08:24:47,328][__main__][INFO] - agents played in iteration 59 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:24:47,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:24:47,871][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:24:47,872][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:24:48,592][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:24:48,886][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:24:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:24:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:24:49,855][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:24:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:24:50,501][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:24:50,823][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:24:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:24:51,471][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:24:51,798][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:24:52,122][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:24:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:24:52,772][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:24:53,100][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:24:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:24:53,744][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:24:54,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:24:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:24:54,722][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:24:55,044][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:24:55,368][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:24:55,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:24:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:24:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:24:56,667][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:24:56,990][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:24:57,313][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:24:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:24:57,959][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:24:58,282][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:24:58,605][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:24:58,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:24:59,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:00,377][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:00,379][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:00,380][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:01,362][__main__][INFO] - Iteration 60 took 19s (27.33% Gen, 67.59% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 44m 23s. Estimated total time: 16h 5m 36s. Time estimates for 10 more iterations: 3m 13s, 100 more iterations: 32m 11s, 500 more iterations: 2h 40m 56s. [2025-11-13 08:25:01,364][__main__][INFO] - Starting iteration 60. [2025-11-13 08:25:01,367][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-13 08:25:01,368][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:06,764][__main__][INFO] - Number of regex retries in iteration 60: 0 [2025-11-13 08:25:06,765][__main__][INFO] - agents played in iteration 60 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:25:07,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:07,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:07,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:08,335][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:09,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:09,950][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:10,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:10,596][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:10,919][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:11,243][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:11,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:12,863][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:13,185][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:13,510][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:13,834][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:14,158][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:14,805][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:15,129][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:15,451][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:15,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:16,425][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:16,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:17,070][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:17,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:18,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:18,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:19,077][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:19,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:19,810][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:19,812][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:21,807][__main__][INFO] - Iteration 61 took 20s (26.40% Gen, 63.83% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 40m 29s. Estimated total time: 17h 2m 2s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 4s, 500 more iterations: 2h 50m 20s. [2025-11-13 08:25:21,809][__main__][INFO] - Starting iteration 61. [2025-11-13 08:25:21,813][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:21,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:27,631][__main__][INFO] - Number of regex retries in iteration 61: 0 [2025-11-13 08:25:27,632][__main__][INFO] - agents played in iteration 61 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:25:28,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:28,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:28,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:29,191][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:29,840][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:30,161][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:30,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:30,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:31,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:31,458][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:31,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:32,433][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:32,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:33,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:34,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:34,374][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:34,698][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:35,668][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:35,990][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:36,635][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:36,959][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:37,281][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:37,605][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:37,927][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:38,251][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:38,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:38,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:39,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:39,932][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:25:40,664][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:40,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:40,668][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:41,676][__main__][INFO] - Iteration 62 took 19s (29.29% Gen, 65.62% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 21s. Estimated total time: 16h 33m 13s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 6s, 500 more iterations: 2h 45m 32s. [2025-11-13 08:25:41,679][__main__][INFO] - Starting iteration 62. [2025-11-13 08:25:41,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:25:41,682][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:25:47,366][__main__][INFO] - Number of regex retries in iteration 62: 0 [2025-11-13 08:25:47,366][__main__][INFO] - agents played in iteration 62 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:25:47,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:47,920][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:47,921][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:48,658][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:25:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:25:49,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:25:49,599][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:25:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:25:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:25:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:25:50,893][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:25:51,217][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:25:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:25:51,864][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:25:52,187][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:25:52,511][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:25:52,835][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:25:53,158][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:25:53,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:25:53,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:25:54,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:25:54,454][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:25:54,777][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:25:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:25:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:25:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:25:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:25:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:25:56,716][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:25:57,040][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:25:57,365][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:25:57,688][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:25:58,013][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:25:58,335][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:25:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:25:58,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:25:59,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:00,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:00,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:00,464][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:01,458][__main__][INFO] - Iteration 63 took 19s (28.74% Gen, 66.23% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 6m 38s. Estimated total time: 16h 28m 50s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 57s, 500 more iterations: 2h 44m 48s. [2025-11-13 08:26:01,461][__main__][INFO] - Starting iteration 63. [2025-11-13 08:26:01,464][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:01,465][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:07,035][__main__][INFO] - Number of regex retries in iteration 63: 0 [2025-11-13 08:26:07,036][__main__][INFO] - agents played in iteration 63 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:26:07,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:07,577][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:07,578][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:08,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:08,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:09,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:09,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:10,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:10,848][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:11,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:11,497][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:12,143][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:13,755][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:14,076][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:14,398][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:14,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:15,046][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:15,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:15,692][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:16,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:17,305][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:17,627][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:17,951][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:18,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:18,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:19,306][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:20,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:20,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:20,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:21,185][__main__][INFO] - Iteration 64 took 19s (28.25% Gen, 65.98% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 3m 31s. Estimated total time: 16h 26m 3s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 52s, 500 more iterations: 2h 44m 20s. [2025-11-13 08:26:21,187][__main__][INFO] - Starting iteration 64. [2025-11-13 08:26:21,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:21,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:26,846][__main__][INFO] - Number of regex retries in iteration 64: 0 [2025-11-13 08:26:26,847][__main__][INFO] - agents played in iteration 64 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:26:27,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:27,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:27,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:28,108][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:28,401][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:28,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:29,370][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:29,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:30,017][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:30,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:30,991][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:31,314][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:31,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:31,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:32,283][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:32,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:32,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:33,248][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:33,570][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:33,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:34,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:34,537][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:34,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:35,185][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:35,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:35,832][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:37,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:37,450][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:37,772][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:38,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:38,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:39,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:39,862][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:39,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:39,865][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:40,882][__main__][INFO] - Iteration 65 took 19s (28.72% Gen, 66.11% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 1m 48s. Estimated total time: 16h 24m 40s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 49s, 500 more iterations: 2h 44m 6s. [2025-11-13 08:26:40,885][__main__][INFO] - Starting iteration 65. [2025-11-13 08:26:40,888][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:26:40,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:46,565][__main__][INFO] - Number of regex retries in iteration 65: 0 [2025-11-13 08:26:46,565][__main__][INFO] - agents played in iteration 65 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:26:47,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:47,111][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:47,112][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:26:48,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:26:48,453][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:26:48,781][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:26:49,106][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:26:49,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:26:49,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:26:50,076][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:26:50,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:26:50,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:26:51,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:26:51,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:26:51,694][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:26:52,018][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:26:52,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:26:52,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:26:52,985][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:26:53,308][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:26:53,632][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:26:53,957][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:26:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:26:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:26:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:26:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:26:55,572][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:26:55,894][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:26:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:26:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:26:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:26:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:26:57,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:26:57,837][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:26:58,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:26:58,873][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:26:59,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:59,615][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:59,617][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:00,660][__main__][INFO] - Iteration 66 took 19s (28.71% Gen, 66.01% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 5m 26s. Estimated total time: 16h 28m 37s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 57s, 500 more iterations: 2h 44m 46s. [2025-11-13 08:27:00,662][__main__][INFO] - Starting iteration 66. [2025-11-13 08:27:00,665][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:00,666][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:06,335][__main__][INFO] - Number of regex retries in iteration 66: 0 [2025-11-13 08:27:06,336][__main__][INFO] - agents played in iteration 66 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:27:06,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:06,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:07,608][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:07,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:08,224][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:08,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:08,878][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:09,206][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:09,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:09,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:10,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:10,515][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:10,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:11,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:12,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:12,785][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:13,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:15,058][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:15,380][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:15,704][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:16,027][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:16,351][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:16,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:17,002][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:17,324][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:17,651][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:17,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:18,705][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:19,445][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:19,446][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:19,448][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:20,422][__main__][INFO] - Iteration 67 took 19s (28.69% Gen, 66.37% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 4m 21s. Estimated total time: 16h 27m 52s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 55s, 500 more iterations: 2h 44m 38s. [2025-11-13 08:27:20,425][__main__][INFO] - Starting iteration 67. [2025-11-13 08:27:20,428][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:20,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:26,000][__main__][INFO] - Number of regex retries in iteration 67: 0 [2025-11-13 08:27:26,001][__main__][INFO] - agents played in iteration 67 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:27:26,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:26,554][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:26,554][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:27,274][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:27,568][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:27,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:28,213][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:28,862][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:29,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:29,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:30,159][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:30,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:31,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:31,778][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:32,428][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:32,752][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:33,075][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:33,722][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:34,044][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:34,690][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:35,013][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:35,338][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:35,662][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:35,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:36,309][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:36,636][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:36,963][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:37,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:38,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:39,071][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:39,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:39,074][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:40,115][__main__][INFO] - Iteration 68 took 19s (28.30% Gen, 66.40% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 0m 34s. Estimated total time: 16h 24m 25s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 48s, 500 more iterations: 2h 44m 4s. [2025-11-13 08:27:40,118][__main__][INFO] - Starting iteration 68. [2025-11-13 08:27:40,121][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:40,121][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:45,691][__main__][INFO] - Number of regex retries in iteration 68: 0 [2025-11-13 08:27:45,691][__main__][INFO] - agents played in iteration 68 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:27:46,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:46,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:46,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:46,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:27:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:27:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:27:47,908][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:27:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:27:48,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:27:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:27:49,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:27:49,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:27:49,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:27:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:27:50,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:27:50,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:27:51,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:27:51,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:27:51,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:27:52,109][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:27:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:27:52,754][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:27:53,075][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:27:53,399][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:27:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:27:54,045][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:27:54,367][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:27:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:27:55,012][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:27:55,336][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:27:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:27:55,982][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:27:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:27:56,628][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:27:56,954][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:27:57,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:27:57,988][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:27:58,738][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:58,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:58,741][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:59,730][__main__][INFO] - Iteration 69 took 19s (28.41% Gen, 66.54% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 56m 21s. Estimated total time: 16h 20m 32s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 41s, 500 more iterations: 2h 43m 25s. [2025-11-13 08:27:59,733][__main__][INFO] - Starting iteration 69. [2025-11-13 08:27:59,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:27:59,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:05,364][__main__][INFO] - Number of regex retries in iteration 69: 0 [2025-11-13 08:28:05,365][__main__][INFO] - agents played in iteration 69 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:28:05,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:05,918][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:05,919][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:06,632][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:06,927][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:07,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:07,892][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:08,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:08,537][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:08,860][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:09,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:09,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:09,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:10,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:10,476][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:10,800][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:11,129][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:11,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:11,777][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:12,423][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:12,746][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:13,069][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:13,392][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:13,713][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:14,037][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:14,362][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:15,331][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:15,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:15,976][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:16,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:16,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:17,662][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:18,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:18,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:18,407][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:19,409][__main__][INFO] - Iteration 70 took 19s (28.61% Gen, 66.29% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 59m 10s. Estimated total time: 16h 23m 40s. Time estimates for 10 more iterations: 3m 16s, 100 more iterations: 32m 47s, 500 more iterations: 2h 43m 56s. [2025-11-13 08:28:19,411][__main__][INFO] - Starting iteration 70. [2025-11-13 08:28:19,414][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-13 08:28:19,415][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:25,054][__main__][INFO] - Number of regex retries in iteration 70: 0 [2025-11-13 08:28:25,054][__main__][INFO] - agents played in iteration 70 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:28:25,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:25,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:25,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:26,315][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:26,611][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:26,933][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:27,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:27,580][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:27,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:28,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:28,870][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:29,192][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:29,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:31,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:31,468][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:32,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:33,087][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:33,413][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:34,060][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:34,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:35,352][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:35,675][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:35,997][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:36,321][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:36,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:37,356][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:38,105][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:38,107][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:38,109][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:40,063][__main__][INFO] - Iteration 71 took 20s (27.31% Gen, 63.22% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 47m 38s. Estimated total time: 17h 12m 29s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 24s, 500 more iterations: 2h 52m 4s. [2025-11-13 08:28:40,065][__main__][INFO] - Starting iteration 71. [2025-11-13 08:28:40,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:28:40,069][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:46,168][__main__][INFO] - Number of regex retries in iteration 71: 0 [2025-11-13 08:28:46,168][__main__][INFO] - agents played in iteration 71 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:28:46,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:46,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:46,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:47,427][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:28:47,721][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:28:48,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:28:48,369][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:28:48,691][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:28:49,013][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:28:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:28:49,659][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:28:49,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:28:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:28:50,627][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:28:50,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:28:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:28:51,594][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:28:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:28:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:28:52,564][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:28:52,886][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:28:53,211][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:28:53,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:28:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:28:54,190][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:28:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:28:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:28:55,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:28:55,485][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:28:55,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:28:56,130][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:28:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:28:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:28:57,101][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:28:57,423][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:28:57,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:28:58,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:28:59,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:59,194][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:59,196][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:00,221][__main__][INFO] - Iteration 72 took 20s (30.26% Gen, 64.65% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 22m 27s. Estimated total time: 16h 47m 38s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 35s, 500 more iterations: 2h 47m 56s. [2025-11-13 08:29:00,223][__main__][INFO] - Starting iteration 72. [2025-11-13 08:29:00,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:00,226][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:06,169][__main__][INFO] - Number of regex retries in iteration 72: 0 [2025-11-13 08:29:06,170][__main__][INFO] - agents played in iteration 72 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:29:06,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:06,712][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:06,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:07,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:07,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:08,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:08,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:09,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:09,347][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:09,675][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:10,001][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:10,327][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:10,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:11,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:11,624][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:12,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:12,600][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:13,248][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:13,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:14,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:14,542][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:14,867][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:15,191][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:15,516][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:16,163][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:16,487][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:16,813][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:17,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:18,496][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:19,240][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:19,241][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:19,243][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:20,305][__main__][INFO] - Iteration 73 took 20s (29.60% Gen, 65.10% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 18m 29s. Estimated total time: 16h 44m 0s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 28s, 500 more iterations: 2h 47m 20s. [2025-11-13 08:29:20,307][__main__][INFO] - Starting iteration 73. [2025-11-13 08:29:20,311][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:20,311][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:26,174][__main__][INFO] - Number of regex retries in iteration 73: 0 [2025-11-13 08:29:26,174][__main__][INFO] - agents played in iteration 73 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:29:26,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:26,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:26,717][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:27,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:28,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:28,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:29,021][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:29,343][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:29,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:29,988][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:30,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:30,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:30,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:31,278][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:32,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:32,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:33,215][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:33,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:34,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:34,830][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:35,153][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:35,475][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:35,801][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:36,123][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:36,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:37,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:38,454][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:39,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:39,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:39,183][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:40,151][__main__][INFO] - Iteration 74 took 19s (29.55% Gen, 65.57% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 6m 11s. Estimated total time: 16h 32m 2s. Time estimates for 10 more iterations: 3m 18s, 100 more iterations: 33m 4s, 500 more iterations: 2h 45m 20s. [2025-11-13 08:29:40,153][__main__][INFO] - Starting iteration 74. [2025-11-13 08:29:40,156][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:29:40,157][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:46,080][__main__][INFO] - Number of regex retries in iteration 74: 0 [2025-11-13 08:29:46,081][__main__][INFO] - agents played in iteration 74 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:29:46,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:46,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:46,628][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:47,345][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:29:47,641][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:29:47,966][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:29:48,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:29:48,613][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:29:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:29:49,258][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:29:49,580][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:29:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:29:50,227][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:29:50,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:29:50,871][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:29:51,194][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:29:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:29:51,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:29:52,164][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:29:52,486][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:29:52,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:29:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:29:53,456][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:29:53,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:29:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:29:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:29:54,746][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:29:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:29:55,392][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:29:55,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:29:56,039][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:29:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:29:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:29:57,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:29:57,328][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:29:57,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:29:58,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:29:59,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:59,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:59,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:00,379][__main__][INFO] - Iteration 75 took 20s (29.29% Gen, 64.56% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 24m 59s. Estimated total time: 16h 51m 11s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 42s, 500 more iterations: 2h 48m 31s. [2025-11-13 08:30:00,381][__main__][INFO] - Starting iteration 75. [2025-11-13 08:30:00,385][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:00,386][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:06,243][__main__][INFO] - Number of regex retries in iteration 75: 0 [2025-11-13 08:30:06,243][__main__][INFO] - agents played in iteration 75 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:30:06,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:06,787][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:06,787][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:08,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:08,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:08,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:09,429][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:10,076][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:10,401][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:10,728][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:11,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:11,379][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:11,704][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:12,027][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:13,640][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:13,964][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:14,286][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:14,610][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:14,938][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:15,262][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:15,906][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:16,553][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:16,876][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:17,202][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:17,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:18,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:19,309][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:19,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:19,312][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:20,349][__main__][INFO] - Iteration 76 took 19s (29.34% Gen, 65.46% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 11m 43s. Estimated total time: 16h 38m 14s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 16s, 500 more iterations: 2h 46m 22s. [2025-11-13 08:30:20,351][__main__][INFO] - Starting iteration 76. [2025-11-13 08:30:20,354][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:20,354][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:26,237][__main__][INFO] - Number of regex retries in iteration 76: 0 [2025-11-13 08:30:26,237][__main__][INFO] - agents played in iteration 76 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:30:26,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:26,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:26,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:28,130][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:28,779][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:29,103][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:29,425][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:29,749][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:30,074][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:31,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:32,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:32,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:32,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:33,316][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:34,284][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:34,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:34,930][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:35,255][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:35,580][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:35,905][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:36,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:37,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:37,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:38,583][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:39,338][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:39,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:39,341][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:30:40,351][__main__][INFO] - Iteration 77 took 19s (29.42% Gen, 65.52% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 13m 4s. Estimated total time: 16h 39m 55s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 19s, 500 more iterations: 2h 46m 39s. [2025-11-13 08:30:40,353][__main__][INFO] - Starting iteration 77. [2025-11-13 08:30:40,356][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:30:40,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:46,329][__main__][INFO] - Number of regex retries in iteration 77: 0 [2025-11-13 08:30:46,330][__main__][INFO] - agents played in iteration 77 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:30:46,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:46,878][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:46,878][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:30:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:30:48,209][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:30:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:30:48,853][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:30:49,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:30:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:30:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:30:50,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:30:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:30:50,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:30:51,112][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:30:51,437][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:30:51,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:30:52,082][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:30:52,405][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:30:52,729][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:30:53,052][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:30:53,373][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:30:53,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:30:54,023][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:30:54,351][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:30:54,677][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:30:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:30:55,325][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:30:55,648][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:30:55,971][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:30:56,295][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:30:56,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:30:56,943][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:30:57,266][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:30:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:30:57,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:30:58,620][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:30:59,356][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:30:59,357][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:30:59,359][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:00,437][__main__][INFO] - Iteration 78 took 20s (29.74% Gen, 64.88% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 16m 51s. Estimated total time: 16h 44m 2s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 28s, 500 more iterations: 2h 47m 20s. [2025-11-13 08:31:00,439][__main__][INFO] - Starting iteration 78. [2025-11-13 08:31:00,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:00,442][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:04,815][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2025-11-13 08:31:07,368][__main__][INFO] - Number of regex retries in iteration 78: 1 [2025-11-13 08:31:07,369][__main__][INFO] - agents played in iteration 78 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:31:07,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:07,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:07,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:07,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:07,946][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:07,947][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:08,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:08,895][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:09,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:09,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:09,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:10,190][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:10,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:10,834][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:11,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:12,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:12,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:13,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:13,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:14,062][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:14,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:14,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:15,359][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:15,683][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:16,008][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:16,333][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:16,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:16,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:17,304][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:17,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:17,950][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:18,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:18,600][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:18,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:19,631][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:20,360][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:20,362][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:20,363][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:21,229][__main__][INFO] - Iteration 79 took 20s (33.32% Gen, 62.51% Train). Generation: 6s, Training: 12s. Estimated remaining time: 16h 51m 53s. Estimated total time: 17h 19m 25s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 38s, 500 more iterations: 2h 53m 14s. [2025-11-13 08:31:21,231][__main__][INFO] - Starting iteration 79. [2025-11-13 08:31:21,234][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:21,235][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:26,884][__main__][INFO] - Number of regex retries in iteration 79: 0 [2025-11-13 08:31:26,885][__main__][INFO] - agents played in iteration 79 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:31:27,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:27,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:27,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:27,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:27,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:27,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:28,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:28,429][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:28,754][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:29,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:29,401][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:29,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:30,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:30,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:31,343][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:31,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:32,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:32,967][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:33,296][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:33,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:34,598][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:34,920][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:35,247][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:35,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:35,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:36,216][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:36,541][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:36,869][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:37,198][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:37,852][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:38,177][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:38,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:39,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:39,946][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:39,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:39,949][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:40,936][__main__][INFO] - Iteration 80 took 19s (28.68% Gen, 66.30% Train). Generation: 5s, Training: 13s. Estimated remaining time: 15h 57m 17s. Estimated total time: 16h 25m 9s. Time estimates for 10 more iterations: 3m 17s, 100 more iterations: 32m 50s, 500 more iterations: 2h 44m 11s. [2025-11-13 08:31:40,938][__main__][INFO] - Starting iteration 80. [2025-11-13 08:31:40,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-13 08:31:40,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:46,668][__main__][INFO] - Number of regex retries in iteration 80: 0 [2025-11-13 08:31:46,668][__main__][INFO] - agents played in iteration 80 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:31:47,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:47,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:47,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:47,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:47,231][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:47,232][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:31:48,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:31:48,530][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:31:48,855][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:31:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:31:49,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:31:49,828][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:31:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:31:50,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:31:50,805][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:31:51,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:31:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:31:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:31:52,110][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:31:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:31:52,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:31:53,084][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:31:53,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:31:53,730][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:31:54,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:31:54,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:31:54,697][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:31:55,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:31:55,344][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:31:55,668][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:31:55,993][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:31:56,315][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:31:56,638][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:31:56,962][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:31:57,286][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:31:57,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:31:57,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:31:58,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:31:58,975][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:31:59,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:59,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:59,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:01,685][__main__][INFO] - Iteration 81 took 20s (27.60% Gen, 62.83% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 49m 3s. Estimated total time: 17h 17m 15s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 34s, 500 more iterations: 2h 52m 52s. [2025-11-13 08:32:01,687][__main__][INFO] - Starting iteration 81. [2025-11-13 08:32:01,690][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:01,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:07,850][__main__][INFO] - Number of regex retries in iteration 81: 0 [2025-11-13 08:32:07,851][__main__][INFO] - agents played in iteration 81 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:32:08,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:08,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:08,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:09,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:09,413][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:09,741][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:10,391][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:11,043][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:11,366][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:12,021][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:12,345][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:12,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:12,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:13,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:13,647][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:14,299][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:14,628][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:14,954][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:15,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:15,924][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:16,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:16,570][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:16,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:17,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:17,547][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:18,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:18,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:19,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:19,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:20,207][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:20,935][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:20,937][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:20,938][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:21,872][__main__][INFO] - Iteration 82 took 20s (30.52% Gen, 64.85% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 20m 35s. Estimated total time: 16h 49m 8s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 38s, 500 more iterations: 2h 48m 11s. [2025-11-13 08:32:21,873][__main__][INFO] - Starting iteration 82. [2025-11-13 08:32:21,876][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:21,876][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:26,069][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2025-11-13 08:32:28,585][__main__][INFO] - Number of regex retries in iteration 82: 1 [2025-11-13 08:32:28,585][__main__][INFO] - agents played in iteration 82 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:32:29,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:29,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:29,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:29,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:29,149][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:29,149][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:29,811][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:30,105][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:30,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:30,753][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:31,077][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:31,403][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:31,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:32,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:32,699][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:33,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:34,645][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:34,969][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:35,293][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:35,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:36,264][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:36,586][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:36,909][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:37,558][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:37,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:38,207][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:38,531][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:38,854][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:39,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:39,507][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:39,833][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:32:40,163][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:32:40,871][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:32:41,589][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:41,591][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:41,592][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:42,492][__main__][INFO] - Iteration 83 took 20s (32.54% Gen, 63.09% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 41m 57s. Estimated total time: 17h 10m 50s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 21s, 500 more iterations: 2h 51m 48s. [2025-11-13 08:32:42,494][__main__][INFO] - Starting iteration 83. [2025-11-13 08:32:42,497][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:32:42,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:48,476][__main__][INFO] - Number of regex retries in iteration 83: 0 [2025-11-13 08:32:48,476][__main__][INFO] - agents played in iteration 83 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:32:48,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:48,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:49,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:49,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:49,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:49,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:32:50,014][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:32:50,343][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:32:50,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:32:50,992][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:32:51,320][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:32:51,645][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:32:51,970][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:32:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:32:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:32:52,938][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:32:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:32:53,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:32:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:32:54,235][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:32:54,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:32:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:32:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:32:55,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:32:55,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:32:56,176][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:32:56,497][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:32:56,821][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:32:57,143][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:32:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:32:57,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:32:58,113][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:32:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:32:58,760][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:32:59,082][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:32:59,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:32:59,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:00,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:00,767][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:01,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:01,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:01,504][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:02,463][__main__][INFO] - Iteration 84 took 19s (29.95% Gen, 65.25% Train). Generation: 5s, Training: 13s. Estimated remaining time: 16h 9m 8s. Estimated total time: 16h 38m 21s. Time estimates for 10 more iterations: 3m 19s, 100 more iterations: 33m 16s, 500 more iterations: 2h 46m 23s. [2025-11-13 08:33:02,465][__main__][INFO] - Starting iteration 84. [2025-11-13 08:33:02,468][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:02,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:08,508][__main__][INFO] - Number of regex retries in iteration 84: 0 [2025-11-13 08:33:08,509][__main__][INFO] - agents played in iteration 84 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:33:08,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:08,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:09,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:09,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:09,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:09,065][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:10,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:10,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:10,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:11,349][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:11,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:11,994][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:12,321][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:12,645][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:12,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:13,295][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:14,277][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:15,249][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:15,572][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:15,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:16,540][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:16,864][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:17,186][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:17,509][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:17,833][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:19,128][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:19,453][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:19,777][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:20,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:20,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:21,541][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:21,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:21,544][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:22,495][__main__][INFO] - Iteration 85 took 20s (30.16% Gen, 65.08% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 11m 52s. Estimated total time: 16h 41m 25s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 22s, 500 more iterations: 2h 46m 54s. [2025-11-13 08:33:22,498][__main__][INFO] - Starting iteration 85. [2025-11-13 08:33:22,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:22,501][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:28,531][__main__][INFO] - Number of regex retries in iteration 85: 0 [2025-11-13 08:33:28,532][__main__][INFO] - agents played in iteration 85 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:33:28,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:29,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:29,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:29,817][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:30,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:30,437][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:30,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:31,083][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:31,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:31,730][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:32,377][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:32,700][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:33,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:33,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:34,004][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:34,974][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:35,622][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:35,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:36,269][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:36,593][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:36,917][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:37,241][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:38,213][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:38,535][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:38,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:39,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:39,505][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:33:39,829][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:33:40,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:33:40,862][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:33:41,583][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:41,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:41,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:42,567][__main__][INFO] - Iteration 86 took 20s (30.05% Gen, 65.05% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 13m 28s. Estimated total time: 16h 43m 21s. Time estimates for 10 more iterations: 3m 20s, 100 more iterations: 33m 26s, 500 more iterations: 2h 47m 13s. [2025-11-13 08:33:42,569][__main__][INFO] - Starting iteration 86. [2025-11-13 08:33:42,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:33:42,574][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:48,713][__main__][INFO] - Number of regex retries in iteration 86: 0 [2025-11-13 08:33:48,714][__main__][INFO] - agents played in iteration 86 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:33:49,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:49,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:49,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:49,989][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:33:50,282][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:33:50,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:33:50,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:33:51,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:33:51,577][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:33:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:33:52,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:33:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:33:52,875][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:33:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:33:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:33:53,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:33:54,170][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:33:54,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:33:54,818][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:33:55,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:33:55,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:33:55,787][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:33:56,114][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:33:56,438][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:33:56,763][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:33:57,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:33:57,415][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:33:57,742][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:33:58,065][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:33:58,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:33:58,718][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:33:59,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:33:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:33:59,704][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:00,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:01,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:01,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:01,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:01,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:02,779][__main__][INFO] - Iteration 87 took 20s (30.39% Gen, 64.78% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 20m 6s. Estimated total time: 16h 50m 19s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 40s, 500 more iterations: 2h 48m 23s. [2025-11-13 08:34:02,782][__main__][INFO] - Starting iteration 87. [2025-11-13 08:34:02,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:02,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:08,992][__main__][INFO] - Number of regex retries in iteration 87: 0 [2025-11-13 08:34:08,993][__main__][INFO] - agents played in iteration 87 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:34:09,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:09,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:09,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:10,570][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:10,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:11,538][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:11,861][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:12,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:12,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:12,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:13,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:13,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:14,125][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:14,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:15,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:15,416][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:15,740][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:16,064][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:17,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:17,360][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:17,683][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:18,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:18,326][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:18,648][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:18,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:19,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:20,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:20,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:21,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:22,049][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:22,051][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:22,053][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:23,009][__main__][INFO] - Iteration 88 took 20s (30.69% Gen, 64.58% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 20m 39s. Estimated total time: 16h 51m 13s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 42s, 500 more iterations: 2h 48m 32s. [2025-11-13 08:34:23,012][__main__][INFO] - Starting iteration 88. [2025-11-13 08:34:23,014][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:23,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:29,147][__main__][INFO] - Number of regex retries in iteration 88: 0 [2025-11-13 08:34:29,148][__main__][INFO] - agents played in iteration 88 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:34:29,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,676][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:29,710][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:29,711][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:30,438][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:31,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:31,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:32,031][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:32,679][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:33,002][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:33,650][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:34,296][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:34,945][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:36,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:36,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:37,551][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:37,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:38,198][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:38,520][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:38,843][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:39,167][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:34:39,823][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:34:40,152][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:34:40,473][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:34:40,798][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:34:41,507][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:34:42,239][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:42,241][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:42,243][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:43,212][__main__][INFO] - Iteration 89 took 20s (30.36% Gen, 64.83% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 19m 2s. Estimated total time: 16h 49m 56s. Time estimates for 10 more iterations: 3m 21s, 100 more iterations: 33m 39s, 500 more iterations: 2h 48m 19s. [2025-11-13 08:34:43,215][__main__][INFO] - Starting iteration 89. [2025-11-13 08:34:43,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:34:43,219][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:49,403][__main__][INFO] - Number of regex retries in iteration 89: 0 [2025-11-13 08:34:49,404][__main__][INFO] - agents played in iteration 89 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:34:49,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:49,962][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:49,963][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:50,680][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:34:50,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:34:51,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:34:51,619][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:34:51,943][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:34:52,265][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:34:52,593][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:34:52,916][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:34:53,237][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:34:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:34:53,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:34:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:34:54,538][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:34:54,862][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:34:55,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:34:55,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:34:55,833][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:34:56,157][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:34:56,484][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:34:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:34:57,140][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:34:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:34:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:34:58,122][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:34:58,445][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:34:58,768][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:34:59,090][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:34:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:34:59,744][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:00,393][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:00,717][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:01,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:01,762][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:02,509][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:02,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:02,513][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:03,505][__main__][INFO] - Iteration 90 took 20s (30.49% Gen, 64.62% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 23m 7s. Estimated total time: 16h 54m 22s. Time estimates for 10 more iterations: 3m 22s, 100 more iterations: 33m 48s, 500 more iterations: 2h 49m 3s. [2025-11-13 08:35:03,507][__main__][INFO] - Starting iteration 90. [2025-11-13 08:35:03,510][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-13 08:35:03,511][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:09,732][__main__][INFO] - Number of regex retries in iteration 90: 0 [2025-11-13 08:35:09,733][__main__][INFO] - agents played in iteration 90 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:35:10,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:10,295][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:10,295][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:11,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:11,308][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:11,952][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:12,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:13,243][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:13,567][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:14,535][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:14,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:15,505][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:15,829][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:16,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:16,798][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:17,122][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:17,448][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:17,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:18,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:18,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:18,747][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:19,392][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:20,687][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:21,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:22,045][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:22,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:22,791][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:22,793][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:24,798][__main__][INFO] - Iteration 91 took 21s (29.23% Gen, 61.35% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 12m 49s. Estimated total time: 17h 44m 24s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 28s, 500 more iterations: 2h 57m 24s. [2025-11-13 08:35:24,800][__main__][INFO] - Starting iteration 91. [2025-11-13 08:35:24,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:24,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:31,412][__main__][INFO] - Number of regex retries in iteration 91: 0 [2025-11-13 08:35:31,413][__main__][INFO] - agents played in iteration 91 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:35:31,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:31,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:31,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:32,710][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:33,654][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:33,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:34,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:34,952][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:35,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:35,921][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:36,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:36,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:37,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:37,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:37,866][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:38,511][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:38,835][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:39,161][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:35:39,484][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:35:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:35:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:35:40,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:35:40,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:35:41,104][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:35:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:35:41,752][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:35:42,075][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:35:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:35:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:35:43,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:35:43,774][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:35:44,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:44,515][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:44,517][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:45,725][__main__][INFO] - Iteration 92 took 20s (31.59% Gen, 62.63% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 54m 10s. Estimated total time: 17h 26m 7s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 52s, 500 more iterations: 2h 54m 21s. [2025-11-13 08:35:45,727][__main__][INFO] - Starting iteration 92. [2025-11-13 08:35:45,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:35:45,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:52,051][__main__][INFO] - Number of regex retries in iteration 92: 0 [2025-11-13 08:35:52,051][__main__][INFO] - agents played in iteration 92 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:35:52,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:52,610][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:52,610][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:35:53,314][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:35:53,608][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:35:53,930][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:35:54,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:35:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:35:54,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:35:55,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:35:55,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:35:55,875][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:35:56,201][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:35:56,525][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:35:56,847][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:35:57,169][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:35:57,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:35:57,815][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:35:58,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:35:58,469][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:35:58,792][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:35:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:35:59,435][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:35:59,759][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:00,404][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:00,727][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:01,049][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:01,371][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:01,698][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:02,021][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:02,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:02,668][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:03,312][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:03,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:04,349][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:05,088][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:05,089][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:05,091][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:06,110][__main__][INFO] - Iteration 93 took 20s (31.01% Gen, 63.98% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 26m 47s. Estimated total time: 16h 59m 4s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 58s, 500 more iterations: 2h 49m 50s. [2025-11-13 08:36:06,112][__main__][INFO] - Starting iteration 93. [2025-11-13 08:36:06,116][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:06,116][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:12,526][__main__][INFO] - Number of regex retries in iteration 93: 0 [2025-11-13 08:36:12,527][__main__][INFO] - agents played in iteration 93 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:36:12,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:13,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:13,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:13,818][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:14,113][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:14,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:15,087][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:15,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:15,734][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:16,056][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:16,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:17,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:18,318][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:18,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:19,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:20,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:20,595][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:20,921][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:21,250][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:22,228][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:23,202][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:23,527][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:23,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:24,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:24,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:25,621][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:25,622][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:25,624][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:26,627][__main__][INFO] - Iteration 94 took 20s (31.25% Gen, 63.85% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 32m 58s. Estimated total time: 17h 5m 36s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 11s, 500 more iterations: 2h 50m 56s. [2025-11-13 08:36:26,629][__main__][INFO] - Starting iteration 94. [2025-11-13 08:36:26,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:26,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:33,150][__main__][INFO] - Number of regex retries in iteration 94: 0 [2025-11-13 08:36:33,151][__main__][INFO] - agents played in iteration 94 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:36:33,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:33,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:33,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:34,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:34,727][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:35,050][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:35,372][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:35,694][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:36,018][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:36,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:36,992][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:37,958][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:38,281][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:38,604][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:38,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:39,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:36:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:36:39,896][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:36:40,219][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:36:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:36:40,864][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:36:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:36:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:36:41,835][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:36:42,157][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:36:42,480][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:36:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:36:43,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:36:43,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:36:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:36:44,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:36:44,418][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:36:44,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:36:45,456][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:36:46,181][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:46,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:46,184][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:47,225][__main__][INFO] - Iteration 95 took 20s (31.65% Gen, 63.29% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 36m 43s. Estimated total time: 17h 9m 41s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 19s, 500 more iterations: 2h 51m 36s. [2025-11-13 08:36:47,227][__main__][INFO] - Starting iteration 95. [2025-11-13 08:36:47,230][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:36:47,231][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:36:53,597][__main__][INFO] - Number of regex retries in iteration 95: 0 [2025-11-13 08:36:53,598][__main__][INFO] - agents played in iteration 95 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:36:54,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:36:54,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:36:54,163][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:54,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:36:55,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:36:55,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:36:55,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:36:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:36:56,471][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:36:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:36:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:36:57,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:36:57,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:36:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:36:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:36:58,738][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:36:59,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:36:59,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:36:59,710][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:00,031][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:00,356][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:00,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:01,003][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:01,324][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:01,647][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:02,292][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:02,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:02,940][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:03,585][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:03,914][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:04,886][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:05,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:05,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:06,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:06,650][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:06,652][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:07,740][__main__][INFO] - Iteration 96 took 20s (31.04% Gen, 63.65% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 32m 15s. Estimated total time: 17h 5m 33s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 11s, 500 more iterations: 2h 50m 55s. [2025-11-13 08:37:07,742][__main__][INFO] - Starting iteration 96. [2025-11-13 08:37:07,745][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:07,745][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:14,072][__main__][INFO] - Number of regex retries in iteration 96: 0 [2025-11-13 08:37:14,073][__main__][INFO] - agents played in iteration 96 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:37:14,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:14,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:14,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:15,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:15,990][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:16,313][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:16,638][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:16,964][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:17,292][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:17,617][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:18,267][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:18,589][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:18,912][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:19,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:19,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:20,206][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:20,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:21,173][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:21,824][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:22,147][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:22,472][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:22,796][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:23,120][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:23,443][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:23,770][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:24,101][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:24,426][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:24,748][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:25,072][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:25,398][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:25,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:26,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:27,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:27,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:27,156][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:28,101][__main__][INFO] - Iteration 97 took 20s (31.08% Gen, 64.27% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 24m 12s. Estimated total time: 16h 57m 51s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 55s, 500 more iterations: 2h 49m 38s. [2025-11-13 08:37:28,103][__main__][INFO] - Starting iteration 97. [2025-11-13 08:37:28,106][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:28,106][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:34,483][__main__][INFO] - Number of regex retries in iteration 97: 0 [2025-11-13 08:37:34,484][__main__][INFO] - agents played in iteration 97 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:37:34,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:34,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:35,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:35,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:35,054][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:35,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:35,783][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:36,079][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:36,405][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:36,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:37,054][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:37,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:37,704][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:38,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:38,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:38,673][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:37:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:37:39,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:37:40,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:37:40,615][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:37:40,937][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:37:41,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:37:41,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:37:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:37:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:37:42,562][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:37:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:37:43,213][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:37:43,537][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:37:43,862][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:37:44,186][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:37:44,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:37:44,840][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:37:45,169][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:37:45,495][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:37:45,819][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:37:46,147][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:37:46,868][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:37:47,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:47,607][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:47,608][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:48,666][__main__][INFO] - Iteration 98 took 20s (31.02% Gen, 63.83% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 34m 4s. Estimated total time: 17h 8m 4s. Time estimates for 10 more iterations: 3m 25s, 100 more iterations: 34m 16s, 500 more iterations: 2h 51m 20s. [2025-11-13 08:37:48,669][__main__][INFO] - Starting iteration 98. [2025-11-13 08:37:48,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:37:48,672][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:55,149][__main__][INFO] - Number of regex retries in iteration 98: 0 [2025-11-13 08:37:55,150][__main__][INFO] - agents played in iteration 98 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:37:55,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:55,708][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:55,709][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:56,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:37:56,725][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:37:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:37:57,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:37:57,698][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:37:58,022][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:37:58,345][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:37:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:37:58,992][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:37:59,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:37:59,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:37:59,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:00,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:00,929][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:01,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:01,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:01,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:02,219][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:02,542][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:02,865][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:03,188][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:03,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:04,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:04,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:04,802][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:05,124][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:05,448][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:05,770][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:06,094][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:06,417][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:06,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:07,448][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:08,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:08,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:08,179][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:09,347][__main__][INFO] - Iteration 99 took 20s (31.33% Gen, 63.02% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 39m 28s. Estimated total time: 17h 13m 48s. Time estimates for 10 more iterations: 3m 26s, 100 more iterations: 34m 27s, 500 more iterations: 2h 52m 18s. [2025-11-13 08:38:09,349][__main__][INFO] - Starting iteration 99. [2025-11-13 08:38:09,351][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:09,352][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:15,695][__main__][INFO] - Number of regex retries in iteration 99: 0 [2025-11-13 08:38:15,696][__main__][INFO] - agents played in iteration 99 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:38:16,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:16,264][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:16,264][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:16,973][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:17,265][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:17,589][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:17,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:18,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:18,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:18,879][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:19,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:20,173][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:20,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:20,818][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:21,140][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:21,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:22,439][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:23,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:23,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:24,058][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:24,385][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:24,710][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:25,035][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:25,361][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:25,685][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:26,009][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:26,656][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:27,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:28,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:28,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:28,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:28,744][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:29,722][__main__][INFO] - Iteration 100 took 20s (31.14% Gen, 64.05% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 23m 53s. Estimated total time: 16h 58m 34s. Time estimates for 10 more iterations: 3m 23s, 100 more iterations: 33m 57s, 500 more iterations: 2h 49m 45s. [2025-11-13 08:38:29,724][__main__][INFO] - Starting iteration 100. [2025-11-13 08:38:29,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-13 08:38:29,728][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:36,149][__main__][INFO] - Number of regex retries in iteration 100: 0 [2025-11-13 08:38:36,150][__main__][INFO] - agents played in iteration 100 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:38:36,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:36,710][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:36,710][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:37,710][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:38:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:38:38,357][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:38:38,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:38:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:38:39,329][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:38:39,652][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:38:39,976][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:38:40,304][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:38:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:38:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:38:41,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:38:41,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:38:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:38:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:38:42,576][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:38:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:38:43,223][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:38:43,546][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:38:43,869][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:38:44,193][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:38:44,517][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:38:44,839][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:38:45,162][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:38:45,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:38:45,807][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:38:46,131][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:38:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:38:46,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:38:47,103][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:38:47,428][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:38:47,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:38:48,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:38:49,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:49,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:49,213][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:51,423][__main__][INFO] - Iteration 101 took 21s (29.60% Gen, 60.21% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 29m 46s. Estimated total time: 18h 4m 48s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 9s, 500 more iterations: 3h 0m 48s. [2025-11-13 08:38:51,425][__main__][INFO] - Starting iteration 101. [2025-11-13 08:38:51,427][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:38:51,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:58,191][__main__][INFO] - Number of regex retries in iteration 101: 0 [2025-11-13 08:38:58,192][__main__][INFO] - agents played in iteration 101 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:38:58,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:58,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:58,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:59,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:38:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:00,076][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:00,401][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:00,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:01,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:01,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:02,014][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:02,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:02,661][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:02,983][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:03,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:03,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:03,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:04,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:04,924][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:05,250][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:05,574][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:05,896][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:06,217][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:06,539][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:07,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:07,829][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:08,150][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:08,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:09,118][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:09,440][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:09,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:10,467][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:11,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:11,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:11,213][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:12,179][__main__][INFO] - Iteration 102 took 20s (32.59% Gen, 62.75% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 42m 13s. Estimated total time: 17h 17m 36s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 35s, 500 more iterations: 2h 52m 56s. [2025-11-13 08:39:12,181][__main__][INFO] - Starting iteration 102. [2025-11-13 08:39:12,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:12,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:18,880][__main__][INFO] - Number of regex retries in iteration 102: 0 [2025-11-13 08:39:18,880][__main__][INFO] - agents played in iteration 102 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:39:19,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:19,438][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:19,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:20,156][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:20,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:21,097][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:21,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:21,742][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:22,070][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:22,395][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:22,715][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:23,361][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:23,686][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:24,654][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:24,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:25,301][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:25,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:25,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:26,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:26,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:26,917][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:27,244][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:27,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:27,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:28,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:29,204][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:29,528][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:29,856][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:30,181][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:30,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:31,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:31,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:31,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:31,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:32,963][__main__][INFO] - Iteration 103 took 20s (32.21% Gen, 62.98% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 43m 15s. Estimated total time: 17h 18m 59s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 37s, 500 more iterations: 2h 53m 9s. [2025-11-13 08:39:32,965][__main__][INFO] - Starting iteration 103. [2025-11-13 08:39:32,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:32,969][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:39,711][__main__][INFO] - Number of regex retries in iteration 103: 0 [2025-11-13 08:39:39,711][__main__][INFO] - agents played in iteration 103 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:39:40,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:40,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:40,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:39:41,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:39:41,620][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:39:41,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:39:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:39:42,610][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:39:42,938][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:39:43,265][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:39:43,593][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:39:43,919][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:39:44,245][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:39:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:39:44,894][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:39:45,218][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:39:45,542][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:39:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:39:46,187][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:39:46,509][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:39:46,832][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:39:47,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:39:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:39:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:39:48,123][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:39:48,450][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:39:48,775][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:39:49,099][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:39:49,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:39:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:39:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:39:50,399][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:39:50,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:39:51,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:39:51,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:39:52,103][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:39:52,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:52,841][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:52,843][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:53,834][__main__][INFO] - Iteration 104 took 20s (32.31% Gen, 62.93% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 47m 14s. Estimated total time: 17h 23m 18s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 46s, 500 more iterations: 2h 53m 53s. [2025-11-13 08:39:53,836][__main__][INFO] - Starting iteration 104. [2025-11-13 08:39:53,838][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:39:53,839][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:00,477][__main__][INFO] - Number of regex retries in iteration 104: 0 [2025-11-13 08:40:00,478][__main__][INFO] - agents played in iteration 104 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:40:00,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:00,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:01,050][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:01,050][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:01,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:02,063][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:02,708][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:04,002][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:04,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:04,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:04,975][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:05,296][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:05,620][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:06,267][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:06,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:06,912][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:07,557][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:07,882][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:08,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:09,172][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:09,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:09,816][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:10,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:10,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:11,114][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:11,763][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:12,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:12,809][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:13,542][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:13,543][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:13,545][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:14,646][__main__][INFO] - Iteration 105 took 20s (31.90% Gen, 62.80% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 44m 1s. Estimated total time: 17h 20m 26s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 40s, 500 more iterations: 2h 53m 24s. [2025-11-13 08:40:14,649][__main__][INFO] - Starting iteration 105. [2025-11-13 08:40:14,652][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:14,653][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:21,071][__main__][INFO] - Number of regex retries in iteration 105: 0 [2025-11-13 08:40:21,072][__main__][INFO] - agents played in iteration 105 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:40:21,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:21,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:21,638][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:22,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:22,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:22,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:23,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:23,628][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:23,953][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:24,601][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:24,924][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:25,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:25,569][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:25,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:26,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:26,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:27,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:27,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:28,165][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:28,487][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:28,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:29,138][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:30,750][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:31,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:31,403][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:32,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:32,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:33,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:34,145][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:34,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:34,148][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:35,141][__main__][INFO] - Iteration 106 took 20s (31.33% Gen, 63.82% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 27m 43s. Estimated total time: 17h 4m 29s. Time estimates for 10 more iterations: 3m 24s, 100 more iterations: 34m 8s, 500 more iterations: 2h 50m 44s. [2025-11-13 08:40:35,143][__main__][INFO] - Starting iteration 106. [2025-11-13 08:40:35,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:35,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:41,902][__main__][INFO] - Number of regex retries in iteration 106: 0 [2025-11-13 08:40:41,902][__main__][INFO] - agents played in iteration 106 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:40:42,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:42,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:42,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:43,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:40:43,503][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:40:43,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:40:44,149][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:40:44,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:40:44,795][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:40:45,120][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:40:45,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:40:45,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:40:46,091][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:40:46,414][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:40:46,738][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:40:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:40:47,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:40:47,709][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:40:48,033][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:40:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:40:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:40:49,002][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:40:49,325][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:40:49,648][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:40:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:40:50,293][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:40:50,616][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:40:50,938][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:40:51,264][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:40:51,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:40:51,909][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:40:52,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:40:52,555][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:40:52,877][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:40:53,201][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:40:53,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:40:54,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:40:55,003][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:55,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:55,007][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:56,055][__main__][INFO] - Iteration 107 took 20s (32.31% Gen, 62.67% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 48m 21s. Estimated total time: 17h 25m 27s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 50s, 500 more iterations: 2h 54m 14s. [2025-11-13 08:40:56,057][__main__][INFO] - Starting iteration 107. [2025-11-13 08:40:56,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:40:56,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:02,759][__main__][INFO] - Number of regex retries in iteration 107: 0 [2025-11-13 08:41:02,760][__main__][INFO] - agents played in iteration 107 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:41:03,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:03,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:03,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:04,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:04,670][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:04,993][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:05,317][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:05,641][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:05,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:06,292][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:06,620][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:06,945][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:07,269][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:07,914][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:08,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:08,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:08,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:09,209][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:09,533][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:09,855][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:10,501][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:10,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:11,146][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:11,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:11,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:12,115][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:12,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:13,083][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:13,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:13,730][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:14,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:14,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:15,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:15,820][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:15,822][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:15,824][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:16,873][__main__][INFO] - Iteration 108 took 20s (32.18% Gen, 62.77% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 43m 11s. Estimated total time: 17h 20m 39s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 41s, 500 more iterations: 2h 53m 26s. [2025-11-13 08:41:16,875][__main__][INFO] - Starting iteration 108. [2025-11-13 08:41:16,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:16,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:23,633][__main__][INFO] - Number of regex retries in iteration 108: 0 [2025-11-13 08:41:23,634][__main__][INFO] - agents played in iteration 108 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:41:24,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:24,200][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:24,201][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:24,919][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:25,213][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:25,540][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:25,865][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:26,188][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:26,835][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:27,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:27,808][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:28,135][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:28,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:28,781][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:29,106][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:29,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:30,073][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:30,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:30,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:31,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:32,011][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:32,333][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:32,655][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:32,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:33,300][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:33,624][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:33,949][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:34,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:34,916][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:35,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:35,971][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:36,726][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:36,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:36,729][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:37,731][__main__][INFO] - Iteration 109 took 20s (32.39% Gen, 62.80% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 44m 52s. Estimated total time: 17h 22m 40s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 45s, 500 more iterations: 2h 53m 46s. [2025-11-13 08:41:37,733][__main__][INFO] - Starting iteration 109. [2025-11-13 08:41:37,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:37,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:44,535][__main__][INFO] - Number of regex retries in iteration 109: 0 [2025-11-13 08:41:44,536][__main__][INFO] - agents played in iteration 109 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:41:44,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:45,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:45,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:45,812][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:41:46,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:41:46,430][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:41:46,753][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:41:47,077][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:41:47,400][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:41:47,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:41:48,050][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:41:48,374][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:41:48,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:41:49,019][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:41:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:41:49,668][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:41:49,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:41:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:41:50,642][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:41:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:41:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:41:51,609][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:41:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:41:52,255][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:41:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:41:52,901][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:41:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:41:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:41:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:41:54,194][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:41:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:41:54,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:41:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:41:55,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:41:55,816][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:41:56,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:41:56,845][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:41:57,592][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:57,594][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:57,596][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:58,572][__main__][INFO] - Iteration 110 took 20s (32.63% Gen, 62.68% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 43m 40s. Estimated total time: 17h 21m 49s. Time estimates for 10 more iterations: 3m 28s, 100 more iterations: 34m 43s, 500 more iterations: 2h 53m 38s. [2025-11-13 08:41:58,574][__main__][INFO] - Starting iteration 110. [2025-11-13 08:41:58,577][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-13 08:41:58,578][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:05,328][__main__][INFO] - Number of regex retries in iteration 110: 0 [2025-11-13 08:42:05,328][__main__][INFO] - agents played in iteration 110 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:42:05,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:05,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:05,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:06,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:07,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:07,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:07,875][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:08,198][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:08,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:09,168][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:09,492][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:09,815][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:10,139][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:10,461][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:11,755][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:12,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:12,402][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:12,726][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:13,048][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:13,370][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:13,693][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:14,340][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:14,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:14,988][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:15,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:15,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:16,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:16,607][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:16,935][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:17,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:18,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:18,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:18,412][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:20,486][__main__][INFO] - Iteration 111 took 21s (30.81% Gen, 59.72% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 36m 58s. Estimated total time: 18h 15m 30s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 31s, 500 more iterations: 3h 2m 35s. [2025-11-13 08:42:20,488][__main__][INFO] - Starting iteration 111. [2025-11-13 08:42:20,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:20,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:27,569][__main__][INFO] - Number of regex retries in iteration 111: 0 [2025-11-13 08:42:27,569][__main__][INFO] - agents played in iteration 111 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:42:28,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:28,130][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:28,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:28,848][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:29,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:29,466][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:29,788][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:30,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:30,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:30,758][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:31,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:31,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:31,729][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:32,374][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:33,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:33,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:33,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:34,640][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:34,962][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:35,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:35,932][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:36,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:36,579][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:36,901][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:37,547][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:37,868][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:38,192][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:38,514][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:38,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:42:39,161][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:42:39,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:42:40,613][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:40,615][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:40,617][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:41,599][__main__][INFO] - Iteration 112 took 21s (33.53% Gen, 61.81% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 56m 33s. Estimated total time: 17h 35m 25s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 10s, 500 more iterations: 2h 55m 54s. [2025-11-13 08:42:41,601][__main__][INFO] - Starting iteration 112. [2025-11-13 08:42:41,605][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:42:41,605][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:48,548][__main__][INFO] - Number of regex retries in iteration 112: 0 [2025-11-13 08:42:48,548][__main__][INFO] - agents played in iteration 112 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:42:49,015][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:49,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:49,117][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:42:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:42:50,502][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:42:50,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:42:51,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:42:51,475][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:42:51,802][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:42:52,126][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:42:52,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:42:52,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:42:53,112][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:42:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:42:53,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:42:54,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:42:54,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:42:54,743][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:42:55,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:42:55,388][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:42:55,712][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:42:56,038][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:42:56,366][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:42:56,695][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:42:57,022][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:42:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:42:57,676][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:42:58,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:42:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:42:58,652][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:42:58,975][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:42:59,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:42:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:42:59,946][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:00,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:01,016][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:01,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:01,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:01,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:02,883][__main__][INFO] - Iteration 113 took 21s (32.63% Gen, 62.04% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 4m 42s. Estimated total time: 17h 43m 56s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 27s, 500 more iterations: 2h 57m 19s. [2025-11-13 08:43:02,885][__main__][INFO] - Starting iteration 113. [2025-11-13 08:43:02,887][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:02,888][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:09,770][__main__][INFO] - Number of regex retries in iteration 113: 0 [2025-11-13 08:43:09,771][__main__][INFO] - agents played in iteration 113 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:43:10,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:10,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:10,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:11,041][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:11,334][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:12,307][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:12,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:13,598][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:13,922][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:14,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:14,900][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:15,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:15,546][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:15,869][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:16,518][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:17,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:17,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:18,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:19,114][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:19,760][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:20,082][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:20,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:20,729][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:21,052][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:21,377][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:22,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:22,855][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:22,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:22,858][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:23,845][__main__][INFO] - Iteration 114 took 20s (32.84% Gen, 62.44% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 48m 20s. Estimated total time: 17h 27m 54s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 55s, 500 more iterations: 2h 54m 39s. [2025-11-13 08:43:23,847][__main__][INFO] - Starting iteration 114. [2025-11-13 08:43:23,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:23,851][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:30,855][__main__][INFO] - Number of regex retries in iteration 114: 0 [2025-11-13 08:43:30,855][__main__][INFO] - agents played in iteration 114 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:43:31,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:31,415][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:31,415][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:32,430][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:32,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:33,401][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:33,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:34,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:34,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:34,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:35,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:35,990][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:36,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:36,639][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:37,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:37,930][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:38,577][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:38,901][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:43:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:43:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:43:39,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:43:40,195][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:43:40,517][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:43:40,839][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:43:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:43:41,485][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:43:41,808][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:43:42,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:43:42,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:43:43,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:43:43,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:43:43,897][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:43:43,898][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:43:44,893][__main__][INFO] - Iteration 115 took 21s (33.28% Gen, 61.98% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 52m 17s. Estimated total time: 17h 32m 12s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 4s, 500 more iterations: 2h 55m 22s. [2025-11-13 08:43:44,896][__main__][INFO] - Starting iteration 115. [2025-11-13 08:43:44,898][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:43:44,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:51,793][__main__][INFO] - Number of regex retries in iteration 115: 0 [2025-11-13 08:43:51,794][__main__][INFO] - agents played in iteration 115 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:43:52,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:52,360][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:52,361][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:53,094][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:43:53,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:43:53,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:43:54,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:43:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:43:54,688][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:43:55,013][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:43:55,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:43:55,662][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:43:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:43:56,316][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:43:56,640][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:43:56,964][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:43:57,288][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:43:57,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:43:57,937][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:43:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:43:58,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:43:58,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:43:59,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:43:59,559][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:43:59,884][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:00,213][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:00,540][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:00,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:01,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:01,517][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:01,841][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:02,164][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:02,488][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:03,139][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:03,468][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:04,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:04,933][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:04,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:04,936][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:05,910][__main__][INFO] - Iteration 116 took 21s (32.81% Gen, 62.55% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 50m 21s. Estimated total time: 17h 30m 37s. Time estimates for 10 more iterations: 3m 30s, 100 more iterations: 35m 1s, 500 more iterations: 2h 55m 6s. [2025-11-13 08:44:05,912][__main__][INFO] - Starting iteration 116. [2025-11-13 08:44:05,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:05,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:12,612][__main__][INFO] - Number of regex retries in iteration 116: 0 [2025-11-13 08:44:12,613][__main__][INFO] - agents played in iteration 116 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:44:13,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:13,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:13,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:13,903][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:14,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:14,845][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:15,174][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:16,142][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:16,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:16,788][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:17,112][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:17,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:17,759][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:18,082][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:18,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:18,730][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:19,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:19,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:20,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:20,349][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:20,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:20,995][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:21,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:21,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:21,970][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:22,291][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:22,614][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:22,937][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:23,261][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:23,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:23,906][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:24,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:24,925][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:25,649][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:25,650][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:25,652][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:26,690][__main__][INFO] - Iteration 117 took 20s (32.24% Gen, 62.77% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 38m 8s. Estimated total time: 17h 18m 45s. Time estimates for 10 more iterations: 3m 27s, 100 more iterations: 34m 37s, 500 more iterations: 2h 53m 7s. [2025-11-13 08:44:26,692][__main__][INFO] - Starting iteration 117. [2025-11-13 08:44:26,695][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:26,695][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:33,722][__main__][INFO] - Number of regex retries in iteration 117: 0 [2025-11-13 08:44:33,722][__main__][INFO] - agents played in iteration 117 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:44:34,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:34,292][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:34,292][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:35,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:35,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:35,634][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:36,289][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:36,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:36,942][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:37,911][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:38,235][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:38,559][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:44:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:44:39,542][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:44:39,868][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:44:40,195][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:44:40,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:44:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:44:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:44:41,507][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:44:41,831][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:44:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:44:42,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:44:42,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:44:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:44:43,459][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:44:43,786][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:44:44,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:44:44,434][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:44:44,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:44:45,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:44:45,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:44:46,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:44:46,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:46,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:46,869][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:47,928][__main__][INFO] - Iteration 118 took 21s (33.09% Gen, 61.91% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 0m 44s. Estimated total time: 17h 41m 43s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 23s, 500 more iterations: 2h 56m 57s. [2025-11-13 08:44:47,930][__main__][INFO] - Starting iteration 118. [2025-11-13 08:44:47,933][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:44:47,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:54,845][__main__][INFO] - Number of regex retries in iteration 118: 0 [2025-11-13 08:44:54,845][__main__][INFO] - agents played in iteration 118 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:44:55,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:55,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:55,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:56,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:44:56,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:44:56,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:44:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:44:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:44:57,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:44:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:44:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:44:58,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:44:59,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:44:59,342][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:44:59,665][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:44:59,989][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:00,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:01,282][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:01,606][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:01,929][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:02,253][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:03,224][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:04,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:04,516][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:04,841][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:05,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:05,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:06,135][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:06,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:07,171][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:07,908][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:07,910][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:07,911][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:08,873][__main__][INFO] - Iteration 119 took 20s (33.01% Gen, 62.40% Train). Generation: 6s, Training: 13s. Estimated remaining time: 16h 45m 41s. Estimated total time: 17h 27m 1s. Time estimates for 10 more iterations: 3m 29s, 100 more iterations: 34m 54s, 500 more iterations: 2h 54m 30s. [2025-11-13 08:45:08,875][__main__][INFO] - Starting iteration 119. [2025-11-13 08:45:08,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:08,878][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:15,892][__main__][INFO] - Number of regex retries in iteration 119: 0 [2025-11-13 08:45:15,892][__main__][INFO] - agents played in iteration 119 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:45:16,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:16,455][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:16,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:17,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:17,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:17,804][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:18,127][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:19,420][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:19,744][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:20,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:20,392][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:21,038][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:21,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:21,687][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:22,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:22,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:22,997][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:23,975][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:24,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:24,628][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:24,953][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:25,276][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:25,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:25,924][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:26,247][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:26,573][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:26,899][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:27,223][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:27,548][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:28,261][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:29,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:29,007][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:29,009][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:29,994][__main__][INFO] - Iteration 120 took 21s (33.21% Gen, 62.11% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 54m 11s. Estimated total time: 17h 35m 51s. Time estimates for 10 more iterations: 3m 31s, 100 more iterations: 35m 11s, 500 more iterations: 2h 55m 58s. [2025-11-13 08:45:29,997][__main__][INFO] - Starting iteration 120. [2025-11-13 08:45:30,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-13 08:45:30,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:36,983][__main__][INFO] - Number of regex retries in iteration 120: 0 [2025-11-13 08:45:36,983][__main__][INFO] - agents played in iteration 120 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:45:37,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:37,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:37,539][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:38,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:45:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:45:38,874][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:45:39,196][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:45:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:45:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:45:40,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:45:40,490][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:45:40,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:45:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:45:41,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:45:41,782][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:45:42,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:45:42,431][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:45:42,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:45:43,078][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:45:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:45:43,726][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:45:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:45:44,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:45:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:45:45,023][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:45:45,346][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:45:45,669][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:45:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:45:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:45:46,644][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:45:46,969][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:45:47,295][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:45:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:45:47,945][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:45:48,271][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:45:48,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:45:49,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:45:50,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:50,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:50,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:52,112][__main__][INFO] - Iteration 121 took 22s (31.58% Gen, 59.18% Train). Generation: 6s, Training: 13s. Estimated remaining time: 17h 43m 37s. Estimated total time: 18h 25m 40s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 51s, 500 more iterations: 3h 4m 16s. [2025-11-13 08:45:52,115][__main__][INFO] - Starting iteration 121. [2025-11-13 08:45:52,117][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:45:52,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:59,488][__main__][INFO] - Number of regex retries in iteration 121: 0 [2025-11-13 08:45:59,489][__main__][INFO] - agents played in iteration 121 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:45:59,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:59,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:00,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:00,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:00,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:01,080][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:01,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:01,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:03,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:03,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:03,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:04,317][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:04,640][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:04,963][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:05,287][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:05,612][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:05,934][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:06,258][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:06,586][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:06,911][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:07,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:07,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:08,210][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:08,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:08,858][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:09,182][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:09,507][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:09,833][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:10,156][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:10,481][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:10,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:11,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:11,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:12,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:12,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:12,610][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:13,619][__main__][INFO] - Iteration 122 took 21s (34.28% Gen, 61.02% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 12m 41s. Estimated total time: 17h 55m 5s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 50s, 500 more iterations: 2h 59m 10s. [2025-11-13 08:46:13,621][__main__][INFO] - Starting iteration 122. [2025-11-13 08:46:13,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:13,624][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:20,872][__main__][INFO] - Number of regex retries in iteration 122: 0 [2025-11-13 08:46:20,873][__main__][INFO] - agents played in iteration 122 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:46:21,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:21,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:21,439][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:22,161][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:22,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:22,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:23,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:23,428][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:23,751][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:24,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:24,397][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:24,722][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:25,046][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:25,695][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:26,018][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:26,669][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:26,994][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:27,316][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:27,640][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:28,286][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:28,610][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:29,259][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:29,583][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:29,906][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:30,551][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:30,874][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:31,522][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:32,167][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:32,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:33,204][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:33,941][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:33,943][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:33,945][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:34,918][__main__][INFO] - Iteration 123 took 21s (34.04% Gen, 61.38% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 1m 59s. Estimated total time: 17h 44m 45s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 29s, 500 more iterations: 2h 57m 27s. [2025-11-13 08:46:34,920][__main__][INFO] - Starting iteration 123. [2025-11-13 08:46:34,923][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:34,923][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:42,315][__main__][INFO] - Number of regex retries in iteration 123: 0 [2025-11-13 08:46:42,315][__main__][INFO] - agents played in iteration 123 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:46:42,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:42,895][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:42,896][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:43,612][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:46:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:46:44,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:46:44,557][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:46:44,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:46:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:46:45,528][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:46:45,852][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:46:46,178][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:46:46,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:46:46,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:46:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:46:47,475][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:46:47,797][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:46:48,123][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:46:48,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:46:48,770][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:46:49,095][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:46:49,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:46:49,740][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:46:50,065][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:46:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:46:50,710][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:46:51,034][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:46:51,359][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:46:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:46:52,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:46:52,333][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:46:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:46:52,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:46:53,303][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:46:53,628][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:46:53,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:46:54,667][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:46:55,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:55,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:55,412][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:56,516][__main__][INFO] - Iteration 124 took 21s (34.23% Gen, 60.65% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 33s. Estimated total time: 17h 59m 41s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 59s, 500 more iterations: 2h 59m 56s. [2025-11-13 08:46:56,519][__main__][INFO] - Starting iteration 124. [2025-11-13 08:46:56,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:46:56,523][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:03,772][__main__][INFO] - Number of regex retries in iteration 124: 0 [2025-11-13 08:47:03,773][__main__][INFO] - agents played in iteration 124 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:47:04,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:04,330][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:04,330][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:05,059][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:05,677][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:06,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:06,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:06,971][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:07,294][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:07,618][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:07,941][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:08,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:08,588][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:08,911][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:09,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:09,887][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:10,211][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:10,536][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:11,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:11,508][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:11,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:12,165][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:13,137][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:13,786][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:14,433][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:14,757][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:15,081][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:15,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:16,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:16,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:16,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:16,872][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:17,868][__main__][INFO] - Iteration 125 took 21s (33.96% Gen, 61.36% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 3m 51s. Estimated total time: 17h 47m 20s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 34s, 500 more iterations: 2h 57m 53s. [2025-11-13 08:47:17,870][__main__][INFO] - Starting iteration 125. [2025-11-13 08:47:17,874][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:17,874][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:25,095][__main__][INFO] - Number of regex retries in iteration 125: 0 [2025-11-13 08:47:25,096][__main__][INFO] - agents played in iteration 125 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:47:25,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:25,662][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:25,662][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:26,379][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:26,672][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:27,642][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:27,968][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:28,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:28,613][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:29,903][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:30,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:31,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:31,847][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:32,170][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:32,493][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:32,817][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:33,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:34,112][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:34,438][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:34,763][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:35,087][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:35,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:35,732][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:36,056][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:36,379][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:36,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:37,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:38,138][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:38,140][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:38,141][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:39,153][__main__][INFO] - Iteration 126 took 21s (33.94% Gen, 61.30% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 0m 10s. Estimated total time: 17h 44m 0s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 28s, 500 more iterations: 2h 57m 20s. [2025-11-13 08:47:39,155][__main__][INFO] - Starting iteration 126. [2025-11-13 08:47:39,158][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:47:39,159][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:46,301][__main__][INFO] - Number of regex retries in iteration 126: 0 [2025-11-13 08:47:46,302][__main__][INFO] - agents played in iteration 126 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:47:46,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:46,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:46,863][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:47:47,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:47:48,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:47:48,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:47:48,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:47:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:47:49,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:47:49,825][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:47:50,148][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:47:50,470][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:47:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:47:51,117][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:47:51,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:47:51,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:47:52,091][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:47:52,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:47:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:47:53,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:47:53,383][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:47:53,708][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:47:54,032][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:47:54,355][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:47:54,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:47:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:47:55,324][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:47:55,649][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:47:55,971][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:47:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:47:56,619][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:47:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:47:57,267][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:47:57,589][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:47:57,912][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:47:58,633][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:47:59,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:59,390][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:59,392][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:00,387][__main__][INFO] - Iteration 127 took 21s (33.65% Gen, 61.66% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 57m 16s. Estimated total time: 17h 41m 27s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 22s, 500 more iterations: 2h 56m 54s. [2025-11-13 08:48:00,389][__main__][INFO] - Starting iteration 127. [2025-11-13 08:48:00,392][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:00,392][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:07,569][__main__][INFO] - Number of regex retries in iteration 127: 0 [2025-11-13 08:48:07,570][__main__][INFO] - agents played in iteration 127 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:48:08,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,105][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:08,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:08,139][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:09,482][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:09,807][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:10,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:10,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:11,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:11,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:12,075][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:12,723][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:13,045][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:13,371][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:13,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:14,345][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:14,670][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:14,994][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:15,318][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:15,644][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:15,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:16,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:16,611][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:17,261][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:17,912][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:18,239][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:18,564][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:18,888][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:19,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:19,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:20,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:20,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:20,679][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:21,656][__main__][INFO] - Iteration 128 took 21s (33.75% Gen, 61.65% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 58m 43s. Estimated total time: 17h 43m 15s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 26s, 500 more iterations: 2h 57m 12s. [2025-11-13 08:48:21,658][__main__][INFO] - Starting iteration 128. [2025-11-13 08:48:21,662][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:21,662][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:28,983][__main__][INFO] - Number of regex retries in iteration 128: 0 [2025-11-13 08:48:28,984][__main__][INFO] - agents played in iteration 128 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:48:29,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:29,549][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:29,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:30,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:30,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:30,874][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:31,519][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:31,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:32,167][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:32,490][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:33,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:33,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:34,104][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:34,431][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:34,755][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:35,077][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:35,721][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:36,369][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:36,695][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:37,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:37,343][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:37,667][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:37,990][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:38,313][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:38,635][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:48:38,960][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:48:39,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:48:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:48:39,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:48:40,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:48:40,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:48:41,275][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:48:41,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:42,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:42,001][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:43,030][__main__][INFO] - Iteration 129 took 21s (34.26% Gen, 60.92% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 3m 33s. Estimated total time: 17h 48m 26s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 36s, 500 more iterations: 2h 58m 4s. [2025-11-13 08:48:43,032][__main__][INFO] - Starting iteration 129. [2025-11-13 08:48:43,035][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:48:43,036][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:48:50,206][__main__][INFO] - Number of regex retries in iteration 129: 0 [2025-11-13 08:48:50,207][__main__][INFO] - agents played in iteration 129 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:48:50,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:48:50,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:48:50,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:48:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:48:51,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:48:52,107][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:48:52,429][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:48:52,753][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:48:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:48:53,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:48:53,724][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:48:54,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:48:54,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:48:54,691][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:48:55,016][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:48:55,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:48:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:48:55,984][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:48:56,307][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:48:56,631][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:48:56,954][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:48:57,278][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:48:57,603][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:48:57,927][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:48:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:48:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:48:58,900][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:48:59,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:48:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:48:59,875][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:00,199][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:00,522][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:00,845][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:01,173][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:01,501][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:01,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:02,545][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:03,268][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:03,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:03,271][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:04,275][__main__][INFO] - Iteration 130 took 21s (33.76% Gen, 61.51% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 56m 46s. Estimated total time: 17h 42m 1s. Time estimates for 10 more iterations: 3m 32s, 100 more iterations: 35m 24s, 500 more iterations: 2h 57m 0s. [2025-11-13 08:49:04,277][__main__][INFO] - Starting iteration 130. [2025-11-13 08:49:04,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 08:49:04,281][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:11,591][__main__][INFO] - Number of regex retries in iteration 130: 0 [2025-11-13 08:49:11,592][__main__][INFO] - agents played in iteration 130 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:49:12,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:12,162][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:12,162][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:12,888][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:13,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:13,838][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:14,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:15,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:15,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:15,777][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:16,102][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:16,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:16,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:17,073][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:17,395][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:18,371][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:19,022][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:19,346][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:19,992][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:20,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:20,639][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:20,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:21,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:21,610][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:22,257][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:22,903][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:23,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:23,945][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:24,689][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:24,690][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:24,692][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:26,606][__main__][INFO] - Iteration 131 took 22s (32.75% Gen, 58.68% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 50m 40s. Estimated total time: 18h 36m 17s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 12s, 500 more iterations: 3h 6m 2s. [2025-11-13 08:49:26,608][__main__][INFO] - Starting iteration 131. [2025-11-13 08:49:26,611][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:26,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:34,644][__main__][INFO] - Number of regex retries in iteration 131: 0 [2025-11-13 08:49:34,645][__main__][INFO] - agents played in iteration 131 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:49:35,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:35,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:35,223][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:35,958][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:36,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:36,901][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:37,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:37,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:38,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:38,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:49:38,848][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:49:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:49:39,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:49:39,819][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:49:40,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:49:40,466][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:49:40,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:49:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:49:41,436][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:49:41,763][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:49:42,089][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:49:42,413][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:49:42,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:49:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:49:43,384][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:49:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:49:44,032][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:49:44,356][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:49:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:49:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:49:45,328][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:49:45,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:49:45,976][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:49:46,301][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:49:47,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:49:47,791][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:47,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:47,794][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:48,795][__main__][INFO] - Iteration 132 took 22s (36.21% Gen, 59.27% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 43m 15s. Estimated total time: 18h 29m 15s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 58s, 500 more iterations: 3h 4m 52s. [2025-11-13 08:49:48,798][__main__][INFO] - Starting iteration 132. [2025-11-13 08:49:48,801][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:49:48,801][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:56,139][__main__][INFO] - Number of regex retries in iteration 132: 0 [2025-11-13 08:49:56,140][__main__][INFO] - agents played in iteration 132 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:49:56,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:56,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:56,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:57,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:49:57,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:49:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:49:58,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:49:58,686][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:49:59,013][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:49:59,337][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:49:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:49:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:00,311][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:00,956][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:01,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:01,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:02,259][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:02,582][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:02,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:03,230][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:03,882][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:04,205][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:04,853][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:05,177][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:05,501][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:05,830][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:06,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:06,476][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:06,805][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:07,450][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:07,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:08,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:09,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:09,208][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:09,210][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:10,195][__main__][INFO] - Iteration 133 took 21s (34.30% Gen, 61.09% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 3m 23s. Estimated total time: 17h 49m 44s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 39s, 500 more iterations: 2h 58m 17s. [2025-11-13 08:50:10,197][__main__][INFO] - Starting iteration 133. [2025-11-13 08:50:10,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:10,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:17,801][__main__][INFO] - Number of regex retries in iteration 133: 0 [2025-11-13 08:50:17,802][__main__][INFO] - agents played in iteration 133 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:50:18,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:18,368][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:18,368][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:19,090][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:19,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:19,707][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:20,031][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:20,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:20,676][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:21,325][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:21,972][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:22,296][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:22,619][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:23,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:23,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:23,912][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:24,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:24,558][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:25,204][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:25,527][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:26,498][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:26,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:27,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:27,472][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:27,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:28,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:28,438][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:28,763][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:29,088][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:29,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:30,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:30,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:30,852][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:30,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:31,902][__main__][INFO] - Iteration 134 took 21s (35.03% Gen, 60.14% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 24s. Estimated total time: 18h 5m 7s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 10s, 500 more iterations: 3h 0m 51s. [2025-11-13 08:50:31,904][__main__][INFO] - Starting iteration 134. [2025-11-13 08:50:31,907][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:31,907][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:39,599][__main__][INFO] - Number of regex retries in iteration 134: 0 [2025-11-13 08:50:39,599][__main__][INFO] - agents played in iteration 134 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:50:40,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:40,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:40,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:40,887][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:50:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:50:41,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:50:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:50:42,154][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:50:42,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:50:42,811][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:50:43,136][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:50:43,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:50:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:50:44,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:50:44,430][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:50:44,754][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:50:45,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:50:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:50:45,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:50:46,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:50:46,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:50:46,702][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:50:47,030][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:50:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:50:47,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:50:48,003][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:50:48,331][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:50:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:50:48,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:50:49,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:50:49,628][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:50:49,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:50:50,281][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:50:50,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:50:50,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:50:51,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:50:51,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:50:52,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:52,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:52,718][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:53,712][__main__][INFO] - Iteration 135 took 21s (35.28% Gen, 60.16% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 23m 13s. Estimated total time: 18h 10m 18s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 20s, 500 more iterations: 3h 1m 43s. [2025-11-13 08:50:53,714][__main__][INFO] - Starting iteration 135. [2025-11-13 08:50:53,717][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:50:53,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:00,910][__main__][INFO] - Number of regex retries in iteration 135: 0 [2025-11-13 08:51:00,910][__main__][INFO] - agents played in iteration 135 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:51:01,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:01,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:01,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:02,507][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:02,829][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:03,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:03,477][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:04,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:04,447][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:04,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:05,092][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:05,416][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:06,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:06,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:07,040][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:07,366][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:07,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:08,014][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:08,341][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:09,310][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:10,282][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:11,577][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:11,903][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:12,227][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:12,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:13,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:14,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:14,015][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:14,016][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:15,097][__main__][INFO] - Iteration 136 took 21s (33.64% Gen, 61.30% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 1m 35s. Estimated total time: 17h 49m 1s. Time estimates for 10 more iterations: 3m 33s, 100 more iterations: 35m 38s, 500 more iterations: 2h 58m 10s. [2025-11-13 08:51:15,099][__main__][INFO] - Starting iteration 136. [2025-11-13 08:51:15,102][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:15,102][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:22,634][__main__][INFO] - Number of regex retries in iteration 136: 0 [2025-11-13 08:51:22,634][__main__][INFO] - agents played in iteration 136 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:51:23,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:23,205][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:23,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:23,926][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:24,220][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:24,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:24,873][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:25,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:25,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:25,853][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:26,177][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:26,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:26,828][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:27,153][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:28,131][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:28,458][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:29,105][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:29,428][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:29,752][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:30,075][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:31,699][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:32,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:32,671][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:32,994][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:33,316][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:33,966][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:34,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:35,004][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:35,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:35,740][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:35,742][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:36,766][__main__][INFO] - Iteration 137 took 21s (34.77% Gen, 60.50% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 28s. Estimated total time: 18h 3m 16s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 6s, 500 more iterations: 3h 0m 32s. [2025-11-13 08:51:36,768][__main__][INFO] - Starting iteration 137. [2025-11-13 08:51:36,771][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:36,772][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:44,083][__main__][INFO] - Number of regex retries in iteration 137: 0 [2025-11-13 08:51:44,084][__main__][INFO] - agents played in iteration 137 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:51:44,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:44,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:44,649][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:45,379][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:51:45,675][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:51:45,999][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:51:46,325][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:51:46,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:51:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:51:47,303][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:51:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:51:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:51:48,288][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:51:48,611][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:51:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:51:49,264][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:51:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:51:49,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:51:50,243][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:51:50,567][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:51:50,890][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:51:51,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:51:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:51:51,867][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:51:52,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:51:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:51:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:51:53,161][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:51:53,486][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:51:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:51:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:51:54,462][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:51:54,789][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:51:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:51:55,441][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:51:55,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:51:56,478][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:51:57,229][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:57,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:57,232][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:58,454][__main__][INFO] - Iteration 138 took 21s (33.72% Gen, 60.64% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 16m 2s. Estimated total time: 18h 4m 11s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 8s, 500 more iterations: 3h 0m 41s. [2025-11-13 08:51:58,457][__main__][INFO] - Starting iteration 138. [2025-11-13 08:51:58,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:51:58,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:05,954][__main__][INFO] - Number of regex retries in iteration 138: 0 [2025-11-13 08:52:05,954][__main__][INFO] - agents played in iteration 138 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:52:06,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:06,517][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:06,517][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:07,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:07,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:08,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:08,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:08,826][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:09,471][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:09,796][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:10,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:11,092][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:11,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:11,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:12,064][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:12,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:12,713][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:13,037][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:13,685][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:14,008][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:14,656][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:14,980][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:15,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:15,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:15,952][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:16,280][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:16,929][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:17,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:18,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:19,029][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:19,031][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:19,032][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:20,008][__main__][INFO] - Iteration 139 took 21s (34.78% Gen, 60.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 8m 57s. Estimated total time: 17h 57m 28s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 54s, 500 more iterations: 2h 59m 34s. [2025-11-13 08:52:20,011][__main__][INFO] - Starting iteration 139. [2025-11-13 08:52:20,016][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:20,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:27,372][__main__][INFO] - Number of regex retries in iteration 139: 0 [2025-11-13 08:52:27,373][__main__][INFO] - agents played in iteration 139 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:52:27,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:27,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:27,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:27,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:27,938][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:27,939][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:28,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:29,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:29,397][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:29,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:30,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:30,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:31,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:31,353][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:31,683][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:32,007][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:32,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:32,656][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:33,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:33,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:34,287][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:34,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:35,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:36,257][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:36,579][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:36,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:37,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:37,895][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:38,219][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:38,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:52:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:52:39,194][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:52:39,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:52:40,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:40,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:40,673][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:52:41,762][__main__][INFO] - Iteration 140 took 21s (33.83% Gen, 61.16% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 28s. Estimated total time: 18h 7m 20s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 14s, 500 more iterations: 3h 1m 13s. [2025-11-13 08:52:41,764][__main__][INFO] - Starting iteration 140. [2025-11-13 08:52:41,767][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 08:52:41,767][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:48,968][__main__][INFO] - Number of regex retries in iteration 140: 0 [2025-11-13 08:52:48,969][__main__][INFO] - agents played in iteration 140 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:52:49,426][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:49,527][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:49,527][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:52:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:52:50,864][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:52:51,188][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:52:51,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:52:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:52:52,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:52:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:52:52,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:52:53,148][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:52:53,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:52:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:52:54,126][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:52:54,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:52:54,783][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:52:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:52:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:52:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:52:56,088][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:52:56,419][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:52:56,744][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:52:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:52:57,395][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:52:57,718][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:52:58,043][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:52:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:52:58,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:52:59,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:52:59,343][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:52:59,666][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:52:59,989][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:00,637][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:01,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:02,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:02,099][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:02,101][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:04,046][__main__][INFO] - Iteration 141 took 22s (32.33% Gen, 58.94% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 44m 44s. Estimated total time: 18h 33m 59s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 7s, 500 more iterations: 3h 5m 39s. [2025-11-13 08:53:04,048][__main__][INFO] - Starting iteration 141. [2025-11-13 08:53:04,051][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:04,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:11,990][__main__][INFO] - Number of regex retries in iteration 141: 0 [2025-11-13 08:53:11,991][__main__][INFO] - agents played in iteration 141 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:53:12,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:12,567][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:12,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:13,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:14,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:14,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:15,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:15,848][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:16,171][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:16,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:16,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:17,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:17,795][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:18,119][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:18,443][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:18,765][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:19,088][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:19,413][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:19,736][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:20,384][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:20,706][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:21,030][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:21,353][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:21,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:22,322][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:22,647][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:23,294][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:23,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:24,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:25,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:25,075][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:25,077][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:26,180][__main__][INFO] - Iteration 142 took 22s (35.88% Gen, 59.13% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 36m 52s. Estimated total time: 18h 26m 29s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 52s, 500 more iterations: 3h 4m 24s. [2025-11-13 08:53:26,182][__main__][INFO] - Starting iteration 142. [2025-11-13 08:53:26,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:26,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:33,800][__main__][INFO] - Number of regex retries in iteration 142: 0 [2025-11-13 08:53:33,800][__main__][INFO] - agents played in iteration 142 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:53:34,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:34,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:34,380][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:35,102][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:35,725][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:36,049][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:36,373][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:36,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:37,024][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:37,349][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:37,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:37,997][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:53:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:53:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:53:38,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:53:39,297][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:53:39,621][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:53:39,945][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:53:40,269][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:53:40,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:53:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:53:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:53:41,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:53:41,899][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:53:42,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:53:42,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:53:42,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:53:43,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:53:43,530][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:53:43,853][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:53:44,177][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:53:44,502][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:53:44,827][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:53:45,150][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:53:45,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:53:46,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:53:46,905][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:46,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:46,908][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:47,881][__main__][INFO] - Iteration 143 took 21s (35.10% Gen, 60.41% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 14m 52s. Estimated total time: 18h 4m 50s. Time estimates for 10 more iterations: 3m 36s, 100 more iterations: 36m 9s, 500 more iterations: 3h 0m 48s. [2025-11-13 08:53:47,883][__main__][INFO] - Starting iteration 143. [2025-11-13 08:53:47,886][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:53:47,887][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:55,515][__main__][INFO] - Number of regex retries in iteration 143: 0 [2025-11-13 08:53:55,515][__main__][INFO] - agents played in iteration 143 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:53:55,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:56,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:56,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:56,823][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:53:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:53:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:53:57,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:53:58,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:53:58,423][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:53:58,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:53:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:53:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:53:59,729][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:00,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:00,376][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:00,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:01,030][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:01,678][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:02,004][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:02,330][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:02,657][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:02,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:04,281][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:04,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:04,931][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:05,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:05,585][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:05,909][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:06,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:07,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:07,921][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:08,665][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:08,666][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:08,668][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:09,649][__main__][INFO] - Iteration 144 took 21s (35.05% Gen, 60.43% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 17m 51s. Estimated total time: 18h 8m 12s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 16s, 500 more iterations: 3h 1m 22s. [2025-11-13 08:54:09,652][__main__][INFO] - Starting iteration 144. [2025-11-13 08:54:09,655][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:09,655][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:17,459][__main__][INFO] - Number of regex retries in iteration 144: 0 [2025-11-13 08:54:17,460][__main__][INFO] - agents played in iteration 144 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:54:17,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:17,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:17,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:18,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:18,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:19,053][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:19,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:20,024][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:20,347][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:20,671][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:20,994][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:21,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:21,969][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:22,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:22,615][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:22,937][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:23,587][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:24,235][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:24,562][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:24,885][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:25,210][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:25,864][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:26,189][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:26,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:27,813][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:28,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:28,462][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:28,786][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:29,109][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:29,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:30,572][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:30,573][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:30,575][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:31,553][__main__][INFO] - Iteration 145 took 21s (35.64% Gen, 59.89% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 24m 15s. Estimated total time: 18h 14m 57s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 29s, 500 more iterations: 3h 2m 29s. [2025-11-13 08:54:31,555][__main__][INFO] - Starting iteration 145. [2025-11-13 08:54:31,558][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:31,559][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:39,544][__main__][INFO] - Number of regex retries in iteration 145: 0 [2025-11-13 08:54:39,545][__main__][INFO] - agents played in iteration 145 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:54:40,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:40,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:40,116][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:40,841][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:54:41,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:54:41,461][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:54:41,785][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:54:42,109][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:54:42,431][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:54:42,755][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:54:43,076][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:54:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:54:43,730][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:54:44,054][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:54:44,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:54:44,702][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:54:45,025][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:54:45,349][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:54:45,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:54:45,997][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:54:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:54:46,644][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:54:46,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:54:47,290][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:54:47,613][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:54:47,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:54:48,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:54:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:54:48,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:54:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:54:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:54:49,879][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:54:50,203][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:54:50,528][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:54:50,857][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:54:51,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:54:51,892][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:54:52,628][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:52,630][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:52,632][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:53,616][__main__][INFO] - Iteration 146 took 22s (36.20% Gen, 59.33% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 31m 51s. Estimated total time: 18h 22m 55s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 45s, 500 more iterations: 3h 3m 49s. [2025-11-13 08:54:53,618][__main__][INFO] - Starting iteration 146. [2025-11-13 08:54:53,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:54:53,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:01,601][__main__][INFO] - Number of regex retries in iteration 146: 0 [2025-11-13 08:55:01,601][__main__][INFO] - agents played in iteration 146 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:55:02,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:02,166][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:02,166][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:02,877][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:03,172][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:03,496][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:04,144][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:04,466][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:04,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:05,112][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:05,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:06,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:07,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:07,381][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:08,031][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:08,355][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:08,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:09,003][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:09,649][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:09,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:10,627][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:10,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:11,275][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:11,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:12,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:12,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:12,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:13,219][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:13,928][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:14,666][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:14,667][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:14,669][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:15,678][__main__][INFO] - Iteration 147 took 22s (36.17% Gen, 59.25% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 31m 23s. Estimated total time: 18h 22m 50s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 45s, 500 more iterations: 3h 3m 48s. [2025-11-13 08:55:15,680][__main__][INFO] - Starting iteration 147. [2025-11-13 08:55:15,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:15,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:23,525][__main__][INFO] - Number of regex retries in iteration 147: 0 [2025-11-13 08:55:23,526][__main__][INFO] - agents played in iteration 147 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:55:23,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:24,091][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:24,092][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:24,814][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:25,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:25,432][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:25,754][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:26,079][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:26,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:26,730][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:27,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:27,381][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:28,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:28,356][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:28,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:29,326][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:29,976][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:30,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:30,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:30,952][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:31,280][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:31,929][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:32,259][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:32,583][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:32,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:33,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:33,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:34,216][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:34,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:34,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:35,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:35,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:36,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:36,684][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:36,686][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:37,886][__main__][INFO] - Iteration 148 took 22s (35.32% Gen, 59.27% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 38m 23s. Estimated total time: 18h 30m 12s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 0s, 500 more iterations: 3h 5m 2s. [2025-11-13 08:55:37,887][__main__][INFO] - Starting iteration 148. [2025-11-13 08:55:37,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:37,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:45,814][__main__][INFO] - Number of regex retries in iteration 148: 0 [2025-11-13 08:55:45,814][__main__][INFO] - agents played in iteration 148 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:55:46,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:46,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:46,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:47,076][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:55:47,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:55:47,692][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:55:48,016][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:55:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:55:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:55:48,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:55:49,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:55:49,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:55:49,963][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:55:50,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:55:50,615][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:55:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:55:51,264][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:55:51,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:55:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:55:52,245][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:55:52,567][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:55:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:55:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:55:53,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:55:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:55:54,196][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:55:54,523][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:55:54,852][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:55:55,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:55:55,509][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:55:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:55:56,161][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:55:56,486][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:55:56,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:55:57,138][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:55:57,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:55:58,177][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:55:58,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:55:58,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:55:58,907][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:59,805][__main__][INFO] - Iteration 149 took 21s (36.16% Gen, 59.74% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 23m 37s. Estimated total time: 18h 15m 48s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 31s, 500 more iterations: 3h 2m 38s. [2025-11-13 08:55:59,807][__main__][INFO] - Starting iteration 149. [2025-11-13 08:55:59,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:55:59,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:07,609][__main__][INFO] - Number of regex retries in iteration 149: 0 [2025-11-13 08:56:07,610][__main__][INFO] - agents played in iteration 149 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:56:08,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:08,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:08,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:08,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:08,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:08,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:08,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:09,497][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:09,820][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:10,144][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:10,466][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:10,789][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:11,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:11,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:12,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:12,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:13,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:13,377][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:13,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:14,027][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:14,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:14,676][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:15,000][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:15,646][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:15,969][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:17,266][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:17,588][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:17,911][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:18,560][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:18,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:19,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:19,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:20,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:20,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:20,648][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:21,576][__main__][INFO] - Iteration 150 took 21s (35.83% Gen, 59.90% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 15m 48s. Estimated total time: 18h 8m 20s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 16s, 500 more iterations: 3h 1m 23s. [2025-11-13 08:56:21,578][__main__][INFO] - Starting iteration 150. [2025-11-13 08:56:21,580][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 08:56:21,581][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:29,238][__main__][INFO] - Number of regex retries in iteration 150: 0 [2025-11-13 08:56:29,239][__main__][INFO] - agents played in iteration 150 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:56:29,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:29,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:29,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:30,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:30,813][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:31,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:32,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:32,767][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:33,089][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:33,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:33,737][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:34,065][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:34,388][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:34,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:35,033][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:35,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:36,003][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:56:36,325][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:56:36,650][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:56:36,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:56:37,295][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:56:37,619][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:56:37,943][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:56:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:56:38,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:56:38,913][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:56:39,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:56:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:56:39,886][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:56:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:56:40,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:56:40,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:56:41,574][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:56:42,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:42,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:42,314][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:44,166][__main__][INFO] - Iteration 151 took 22s (33.91% Gen, 57.89% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 56m 23s. Estimated total time: 18h 49m 18s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 13s. [2025-11-13 08:56:44,168][__main__][INFO] - Starting iteration 151. [2025-11-13 08:56:44,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:56:44,172][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:50,702][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2025-11-13 08:56:53,035][__main__][INFO] - Number of regex retries in iteration 151: 1 [2025-11-13 08:56:53,035][__main__][INFO] - agents played in iteration 151 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:56:53,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:53,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:53,602][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:56:54,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:56:54,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:56:55,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:56:55,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:56:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:56:56,195][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:56:56,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:56:56,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:56:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:56:57,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:56:57,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:56:58,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:56:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:56:58,790][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:56:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:56:59,438][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:56:59,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:00,413][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:00,737][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:01,061][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:01,710][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:02,690][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:03,014][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:03,341][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:03,666][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:04,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:05,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:06,084][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:06,085][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:06,087][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:06,978][__main__][INFO] - Iteration 152 took 22s (38.86% Gen, 57.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 7m 4s. Estimated total time: 19h 0m 22s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 0s, 500 more iterations: 3h 10m 3s. [2025-11-13 08:57:06,981][__main__][INFO] - Starting iteration 152. [2025-11-13 08:57:06,984][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:06,985][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:14,926][__main__][INFO] - Number of regex retries in iteration 152: 0 [2025-11-13 08:57:14,927][__main__][INFO] - agents played in iteration 152 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:57:15,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:15,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:15,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:15,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:15,506][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:15,507][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:16,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:16,478][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:16,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:17,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:17,449][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:17,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:18,095][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:19,065][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:19,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:19,712][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:20,036][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:20,361][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:20,683][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:21,655][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:21,977][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:22,299][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:22,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:22,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:23,270][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:24,243][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:25,213][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:26,183][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:26,507][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:27,217][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:27,928][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:27,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:27,931][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:28,862][__main__][INFO] - Iteration 153 took 21s (36.30% Gen, 59.44% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 15s. Estimated total time: 18h 13m 55s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 27s, 500 more iterations: 3h 2m 19s. [2025-11-13 08:57:28,864][__main__][INFO] - Starting iteration 153. [2025-11-13 08:57:28,866][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:28,866][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:36,914][__main__][INFO] - Number of regex retries in iteration 153: 0 [2025-11-13 08:57:36,914][__main__][INFO] - agents played in iteration 153 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:57:37,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:37,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:38,174][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:57:38,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:57:38,794][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:57:39,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:57:39,446][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:57:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:57:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:57:40,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:57:40,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:57:41,068][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:57:41,390][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:57:41,714][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:57:42,039][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:57:42,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:57:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:57:43,009][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:57:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:57:43,654][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:57:43,976][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:57:44,300][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:57:44,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:57:44,947][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:57:45,270][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:57:45,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:57:45,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:57:46,238][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:57:46,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:57:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:57:47,210][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:57:47,533][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:57:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:57:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:57:48,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:57:49,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:57:49,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:49,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:49,920][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:50,923][__main__][INFO] - Iteration 154 took 22s (36.48% Gen, 58.97% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 50s. Estimated total time: 18h 22m 52s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 45s, 500 more iterations: 3h 3m 48s. [2025-11-13 08:57:50,925][__main__][INFO] - Starting iteration 154. [2025-11-13 08:57:50,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:57:50,928][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:58,899][__main__][INFO] - Number of regex retries in iteration 154: 0 [2025-11-13 08:57:58,899][__main__][INFO] - agents played in iteration 154 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:57:59,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:59,461][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:59,461][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:00,157][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:00,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:01,098][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:01,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:02,393][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:02,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:03,364][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:03,689][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:04,013][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:04,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:04,989][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:05,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:05,640][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:05,966][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:06,937][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:07,584][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:07,907][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:08,554][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:08,877][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:09,201][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:09,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:10,508][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:11,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:11,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:11,929][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:11,930][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:12,812][__main__][INFO] - Iteration 155 took 21s (36.42% Gen, 59.54% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 19m 52s. Estimated total time: 18h 14m 16s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 28s, 500 more iterations: 3h 2m 22s. [2025-11-13 08:58:12,814][__main__][INFO] - Starting iteration 155. [2025-11-13 08:58:12,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:12,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:20,807][__main__][INFO] - Number of regex retries in iteration 155: 0 [2025-11-13 08:58:20,807][__main__][INFO] - agents played in iteration 155 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:58:21,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:21,371][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:21,372][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:22,073][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:22,365][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:23,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:24,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:24,632][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:24,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:25,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:25,598][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:26,247][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:26,569][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:27,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:27,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:27,863][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:28,185][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:28,508][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:28,831][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:29,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:29,476][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:29,801][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:30,125][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:30,448][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:31,097][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:31,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:31,746][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:32,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:33,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:33,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:33,807][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:33,809][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:34,718][__main__][INFO] - Iteration 156 took 21s (36.48% Gen, 59.36% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 21s. Estimated total time: 18h 15m 7s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 30s, 500 more iterations: 3h 2m 31s. [2025-11-13 08:58:34,721][__main__][INFO] - Starting iteration 156. [2025-11-13 08:58:34,723][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:34,724][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:42,631][__main__][INFO] - Number of regex retries in iteration 156: 0 [2025-11-13 08:58:42,632][__main__][INFO] - agents played in iteration 156 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:58:43,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:43,194][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:43,194][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:43,902][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:58:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:58:44,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:58:44,843][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:58:45,167][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:58:45,496][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:58:45,823][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:58:46,148][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:58:46,475][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:58:46,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:58:47,121][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:58:47,444][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:58:47,766][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:58:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:58:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:58:48,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:58:49,060][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:58:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:58:49,708][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:58:50,031][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:58:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:58:50,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:58:51,005][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:58:51,328][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:58:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:58:51,974][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:58:52,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:58:52,621][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:58:52,945][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:58:53,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:58:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:58:53,917][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:58:54,244][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:58:54,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:58:55,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:55,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:55,663][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:56,650][__main__][INFO] - Iteration 157 took 21s (36.06% Gen, 59.43% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 21m 16s. Estimated total time: 18h 16m 23s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 32s, 500 more iterations: 3h 2m 43s. [2025-11-13 08:58:56,652][__main__][INFO] - Starting iteration 157. [2025-11-13 08:58:56,655][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:58:56,655][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:04,460][__main__][INFO] - Number of regex retries in iteration 157: 0 [2025-11-13 08:59:04,460][__main__][INFO] - agents played in iteration 157 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:59:04,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:04,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:05,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:05,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:05,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:05,716][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:06,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:06,984][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:07,308][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:07,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:08,277][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:08,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:08,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:09,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:09,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:10,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:10,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:10,877][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:11,202][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:11,526][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:11,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:12,173][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:12,822][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:13,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:13,795][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:14,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:14,450][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:14,781][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:15,109][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:15,432][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:15,755][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:16,081][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:16,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:17,491][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:17,493][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:17,495][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:18,453][__main__][INFO] - Iteration 158 took 21s (35.80% Gen, 59.79% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 14m 27s. Estimated total time: 18h 9m 56s. Time estimates for 10 more iterations: 3m 37s, 100 more iterations: 36m 19s, 500 more iterations: 3h 1m 39s. [2025-11-13 08:59:18,455][__main__][INFO] - Starting iteration 158. [2025-11-13 08:59:18,457][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:18,458][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:26,088][__main__][INFO] - Number of regex retries in iteration 158: 0 [2025-11-13 08:59:26,089][__main__][INFO] - agents played in iteration 158 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:59:26,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:26,650][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:26,651][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:27,371][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:27,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:27,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:28,313][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:28,638][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:29,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:29,615][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:30,267][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:30,596][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:30,923][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:31,247][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:31,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:31,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:32,221][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:32,549][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:32,873][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:33,198][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:33,525][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:34,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:34,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:35,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:36,460][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:37,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 08:59:38,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 08:59:39,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:39,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:39,213][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:40,266][__main__][INFO] - Iteration 159 took 21s (34.99% Gen, 60.18% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 14m 37s. Estimated total time: 18h 10m 29s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 20s, 500 more iterations: 3h 1m 44s. [2025-11-13 08:59:40,268][__main__][INFO] - Starting iteration 159. [2025-11-13 08:59:40,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 08:59:40,271][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:47,746][__main__][INFO] - Number of regex retries in iteration 159: 0 [2025-11-13 08:59:47,746][__main__][INFO] - agents played in iteration 159 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 08:59:48,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:48,305][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:48,306][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:49,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 08:59:49,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 08:59:49,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 08:59:49,974][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 08:59:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 08:59:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 08:59:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 08:59:51,270][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 08:59:51,594][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 08:59:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 08:59:52,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 08:59:52,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 08:59:52,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 08:59:53,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 08:59:53,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 08:59:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 08:59:54,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 08:59:54,504][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 08:59:54,829][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 08:59:55,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 08:59:55,478][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 08:59:55,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 08:59:56,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 08:59:56,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 08:59:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 08:59:57,113][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 08:59:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 08:59:57,763][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 08:59:58,087][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 08:59:58,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 08:59:58,739][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 08:59:59,064][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 08:59:59,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:00,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:00,816][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:00,818][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:00,819][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:01,777][__main__][INFO] - Iteration 160 took 21s (34.75% Gen, 60.78% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 59m 8s. Estimated total time: 17h 55m 21s. Time estimates for 10 more iterations: 3m 35s, 100 more iterations: 35m 50s, 500 more iterations: 2h 59m 13s. [2025-11-13 09:00:01,779][__main__][INFO] - Starting iteration 160. [2025-11-13 09:00:01,782][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 09:00:01,783][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:09,250][__main__][INFO] - Number of regex retries in iteration 160: 0 [2025-11-13 09:00:09,250][__main__][INFO] - agents played in iteration 160 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:00:09,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:09,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:09,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:10,542][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:10,835][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:11,483][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:11,809][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:12,457][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:12,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:13,437][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:14,407][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:14,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:15,056][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:15,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:16,034][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:16,689][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:17,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:17,342][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:17,666][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:17,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:18,317][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:18,640][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:18,965][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:19,290][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:19,613][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:19,938][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:20,263][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:20,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:21,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:22,339][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:22,340][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:22,342][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:24,490][__main__][INFO] - Iteration 161 took 22s (32.88% Gen, 57.65% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 58m 51s. Estimated total time: 18h 55m 27s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 14s. [2025-11-13 09:00:24,493][__main__][INFO] - Starting iteration 161. [2025-11-13 09:00:24,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:24,496][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:32,156][__main__][INFO] - Number of regex retries in iteration 161: 0 [2025-11-13 09:00:32,157][__main__][INFO] - agents played in iteration 161 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:00:32,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:32,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:32,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:33,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:33,047][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:33,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:34,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:34,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:35,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:35,347][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:35,670][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:36,315][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:36,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:37,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:37,617][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:00:38,270][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:00:38,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:00:38,918][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:00:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:00:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:00:39,894][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:00:40,221][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:00:40,545][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:00:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:00:41,192][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:00:41,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:00:41,841][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:00:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:00:42,489][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:00:42,814][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:00:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:00:43,462][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:00:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:00:44,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:00:44,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:00:45,560][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:45,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:45,563][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:46,577][__main__][INFO] - Iteration 162 took 22s (34.69% Gen, 60.71% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 27m 10s. Estimated total time: 18h 24m 7s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 48s, 500 more iterations: 3h 4m 1s. [2025-11-13 09:00:46,580][__main__][INFO] - Starting iteration 162. [2025-11-13 09:00:46,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:00:46,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:54,362][__main__][INFO] - Number of regex retries in iteration 162: 0 [2025-11-13 09:00:54,363][__main__][INFO] - agents played in iteration 162 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:00:54,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:54,930][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:54,930][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:00:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:00:55,956][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:00:56,283][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:00:56,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:00:56,936][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:00:57,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:00:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:00:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:00:58,229][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:00:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:00:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:00:59,203][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:00:59,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:00:59,851][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:00,176][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:00,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:00,830][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:01,155][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:01,477][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:01,804][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:02,776][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:04,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:04,394][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:04,719][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:05,042][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:05,692][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:06,017][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:06,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:07,494][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:07,495][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:07,497][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:08,588][__main__][INFO] - Iteration 163 took 22s (35.35% Gen, 59.69% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 54s. Estimated total time: 18h 20m 13s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 40s, 500 more iterations: 3h 3m 22s. [2025-11-13 09:01:08,590][__main__][INFO] - Starting iteration 163. [2025-11-13 09:01:08,593][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:08,594][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:16,612][__main__][INFO] - Number of regex retries in iteration 163: 0 [2025-11-13 09:01:16,612][__main__][INFO] - agents played in iteration 163 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:01:17,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:17,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:17,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:18,186][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:18,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:18,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:19,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:19,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:20,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:20,452][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:20,777][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:21,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:21,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:21,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:22,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:22,404][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:23,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:23,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:23,700][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:24,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:24,348][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:24,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:24,994][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:26,296][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:26,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:27,601][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:28,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:28,996][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:29,725][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:29,726][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:29,728][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:30,695][__main__][INFO] - Iteration 164 took 22s (36.28% Gen, 59.34% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 25s. Estimated total time: 18h 25m 6s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 50s, 500 more iterations: 3h 4m 11s. [2025-11-13 09:01:30,697][__main__][INFO] - Starting iteration 164. [2025-11-13 09:01:30,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:30,700][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:01:38,645][__main__][INFO] - Number of regex retries in iteration 164: 0 [2025-11-13 09:01:38,646][__main__][INFO] - agents played in iteration 164 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:01:39,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:01:39,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:01:39,207][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:39,928][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:01:40,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:01:40,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:01:40,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:01:41,198][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:01:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:01:41,848][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:01:42,170][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:01:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:01:42,818][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:01:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:01:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:01:43,791][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:01:44,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:01:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:01:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:01:45,089][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:01:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:01:45,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:01:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:01:46,385][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:01:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:01:47,033][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:01:47,356][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:01:47,682][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:01:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:01:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:01:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:01:48,986][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:01:49,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:01:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:01:49,960][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:01:50,287][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:01:50,994][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:01:51,718][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:51,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:51,721][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:52,719][__main__][INFO] - Iteration 165 took 22s (36.08% Gen, 59.38% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 54s. Estimated total time: 18h 20m 58s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 41s, 500 more iterations: 3h 3m 29s. [2025-11-13 09:01:52,722][__main__][INFO] - Starting iteration 165. [2025-11-13 09:01:52,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:01:52,726][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:00,540][__main__][INFO] - Number of regex retries in iteration 165: 0 [2025-11-13 09:02:00,541][__main__][INFO] - agents played in iteration 165 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:02:01,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:01,114][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:01,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:02,444][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:02,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:03,093][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:03,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:03,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:04,068][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:04,397][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:05,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:05,371][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:05,695][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:06,020][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:06,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:06,668][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:07,317][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:07,643][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:07,968][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:08,294][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:08,943][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:09,268][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:10,892][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:11,541][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:11,867][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:12,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:12,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:13,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:13,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:13,657][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:14,656][__main__][INFO] - Iteration 166 took 21s (35.63% Gen, 59.80% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 11s. Estimated total time: 18h 16m 36s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 33s, 500 more iterations: 3h 2m 46s. [2025-11-13 09:02:14,659][__main__][INFO] - Starting iteration 166. [2025-11-13 09:02:14,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:14,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:22,528][__main__][INFO] - Number of regex retries in iteration 166: 0 [2025-11-13 09:02:22,529][__main__][INFO] - agents played in iteration 166 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:02:23,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:23,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:23,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:23,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:23,104][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:23,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:23,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:24,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:24,438][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:25,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:25,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:26,388][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:27,358][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:27,682][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:28,984][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:29,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:30,934][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:31,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:31,908][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:32,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:32,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:32,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:33,210][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:33,533][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:34,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:34,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:35,600][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:35,601][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:35,603][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:36,503][__main__][INFO] - Iteration 167 took 21s (36.00% Gen, 59.86% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 13m 18s. Estimated total time: 18h 12m 5s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 24s, 500 more iterations: 3h 2m 0s. [2025-11-13 09:02:36,505][__main__][INFO] - Starting iteration 167. [2025-11-13 09:02:36,508][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:36,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:44,427][__main__][INFO] - Number of regex retries in iteration 167: 0 [2025-11-13 09:02:44,428][__main__][INFO] - agents played in iteration 167 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:02:44,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:44,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:44,995][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:45,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:02:46,006][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:02:46,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:02:46,659][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:02:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:02:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:02:47,638][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:02:47,963][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:02:48,287][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:02:48,611][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:02:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:02:49,264][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:02:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:02:49,918][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:02:50,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:02:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:02:50,898][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:02:51,223][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:02:51,549][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:02:51,874][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:02:52,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:02:52,528][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:02:52,850][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:02:53,175][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:02:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:02:53,829][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:02:54,153][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:02:54,481][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:02:54,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:02:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:02:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:02:55,789][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:02:56,115][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:02:56,824][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:02:57,545][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:57,546][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:57,548][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:58,497][__main__][INFO] - Iteration 168 took 21s (36.01% Gen, 59.67% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 20m 21s. Estimated total time: 18h 19m 30s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 39s, 500 more iterations: 3h 3m 15s. [2025-11-13 09:02:58,499][__main__][INFO] - Starting iteration 168. [2025-11-13 09:02:58,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:02:58,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:06,339][__main__][INFO] - Number of regex retries in iteration 168: 0 [2025-11-13 09:03:06,340][__main__][INFO] - agents played in iteration 168 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:03:06,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:06,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:06,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:07,951][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:08,276][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:08,600][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:08,925][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:09,251][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:09,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:10,549][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:12,495][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:13,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:13,790][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:14,765][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:15,413][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:15,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:16,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:16,381][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:16,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:17,028][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:17,672][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:17,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:18,704][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:19,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:19,423][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:19,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:20,454][__main__][INFO] - Iteration 169 took 21s (35.70% Gen, 59.60% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 18m 7s. Estimated total time: 18h 17m 38s. Time estimates for 10 more iterations: 3m 39s, 100 more iterations: 36m 35s, 500 more iterations: 3h 2m 56s. [2025-11-13 09:03:20,456][__main__][INFO] - Starting iteration 169. [2025-11-13 09:03:20,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:20,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:28,223][__main__][INFO] - Number of regex retries in iteration 169: 0 [2025-11-13 09:03:28,224][__main__][INFO] - agents played in iteration 169 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:03:28,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:28,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:28,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:29,527][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:29,822][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:30,145][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:30,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:31,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:31,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:31,762][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:32,732][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:33,057][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:33,708][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:34,031][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:34,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:34,678][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:35,003][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:35,327][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:35,651][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:35,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:36,304][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:36,629][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:36,951][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:37,274][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:37,599][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:03:37,921][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:03:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:03:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:03:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:03:39,218][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:03:39,541][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:03:39,865][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:03:40,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:03:41,294][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:41,296][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:41,297][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:42,356][__main__][INFO] - Iteration 170 took 21s (35.46% Gen, 59.71% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 14m 56s. Estimated total time: 18h 14m 49s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 29s, 500 more iterations: 3h 2m 28s. [2025-11-13 09:03:42,358][__main__][INFO] - Starting iteration 170. [2025-11-13 09:03:42,361][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 09:03:42,362][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:50,414][__main__][INFO] - Number of regex retries in iteration 170: 0 [2025-11-13 09:03:50,415][__main__][INFO] - agents played in iteration 170 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:03:50,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:50,994][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:50,995][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:03:52,019][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:03:52,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:03:52,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:03:52,994][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:03:53,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:03:53,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:03:53,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:03:54,289][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:03:54,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:03:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:03:55,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:03:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:03:55,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:03:56,236][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:03:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:03:56,887][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:03:57,210][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:03:57,534][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:03:57,857][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:03:58,180][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:03:58,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:03:58,828][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:03:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:03:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:03:59,805][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:00,786][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:01,109][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:01,438][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:01,762][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:02,089][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:02,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:03,498][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:03,500][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:03,502][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:05,393][__main__][INFO] - Iteration 171 took 23s (34.97% Gen, 56.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 11m 24s. Estimated total time: 19h 11m 40s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 56s. [2025-11-13 09:04:05,396][__main__][INFO] - Starting iteration 171. [2025-11-13 09:04:05,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:05,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:13,555][__main__][INFO] - Number of regex retries in iteration 171: 0 [2025-11-13 09:04:13,556][__main__][INFO] - agents played in iteration 171 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:04:14,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:14,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:14,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:15,146][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:15,472][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:15,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:16,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:16,452][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:17,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:18,075][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:18,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:19,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:19,710][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:20,035][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:20,359][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:20,682][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:21,005][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:21,328][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:21,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:22,295][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:22,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:22,943][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:23,268][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:24,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:25,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:25,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:26,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:26,647][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:26,649][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:27,600][__main__][INFO] - Iteration 172 took 22s (36.74% Gen, 58.97% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 27s. Estimated total time: 18h 30m 5s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 0s, 500 more iterations: 3h 5m 0s. [2025-11-13 09:04:27,602][__main__][INFO] - Starting iteration 172. [2025-11-13 09:04:27,605][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:27,606][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:35,580][__main__][INFO] - Number of regex retries in iteration 172: 0 [2025-11-13 09:04:35,580][__main__][INFO] - agents played in iteration 172 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:04:36,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:36,150][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:36,150][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:36,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:37,503][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:04:38,152][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:04:38,474][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:04:38,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:04:39,119][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:04:39,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:04:39,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:04:40,089][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:04:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:04:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:04:41,066][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:04:41,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:04:41,722][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:04:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:04:42,372][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:04:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:04:43,022][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:04:43,346][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:04:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:04:43,996][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:04:44,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:04:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:04:44,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:04:45,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:04:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:04:45,944][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:04:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:04:46,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:04:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:04:47,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:04:47,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:04:48,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:48,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:48,687][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:49,667][__main__][INFO] - Iteration 173 took 22s (36.15% Gen, 59.41% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 22m 8s. Estimated total time: 18h 23m 8s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 46s, 500 more iterations: 3h 3m 51s. [2025-11-13 09:04:49,669][__main__][INFO] - Starting iteration 173. [2025-11-13 09:04:49,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:04:49,673][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:57,777][__main__][INFO] - Number of regex retries in iteration 173: 0 [2025-11-13 09:04:57,777][__main__][INFO] - agents played in iteration 173 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:04:58,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:58,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:58,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:59,053][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:04:59,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:04:59,670][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:04:59,995][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:00,650][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:00,976][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:01,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:01,629][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:01,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:03,261][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:03,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:03,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:04,243][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:04,566][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:04,890][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:05,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:05,866][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:06,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:06,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:07,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:07,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:07,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:08,460][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:08,784][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:09,110][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:09,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:10,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:10,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:10,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:10,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:11,852][__main__][INFO] - Iteration 174 took 22s (36.54% Gen, 59.00% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 27m 39s. Estimated total time: 18h 29m 2s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 58s, 500 more iterations: 3h 4m 50s. [2025-11-13 09:05:11,855][__main__][INFO] - Starting iteration 174. [2025-11-13 09:05:11,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:11,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:19,875][__main__][INFO] - Number of regex retries in iteration 174: 0 [2025-11-13 09:05:19,876][__main__][INFO] - agents played in iteration 174 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:05:20,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:20,435][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:20,435][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:21,148][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:21,441][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:21,770][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:22,419][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:22,742][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:23,065][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:23,390][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:23,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:24,367][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:25,014][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:25,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:25,661][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:25,990][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:26,645][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:26,969][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:27,292][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:28,265][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:28,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:29,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:29,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:29,895][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:30,219][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:30,545][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:30,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:31,190][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:31,512][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:32,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:05:32,969][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:32,971][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:32,972][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:33,944][__main__][INFO] - Iteration 175 took 22s (36.30% Gen, 59.30% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 22m 37s. Estimated total time: 18h 24m 21s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 48s, 500 more iterations: 3h 4m 3s. [2025-11-13 09:05:33,947][__main__][INFO] - Starting iteration 175. [2025-11-13 09:05:33,950][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:33,950][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:41,852][__main__][INFO] - Number of regex retries in iteration 175: 0 [2025-11-13 09:05:41,853][__main__][INFO] - agents played in iteration 175 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:05:42,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:42,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:42,418][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:43,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:05:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:05:43,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:05:44,079][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:05:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:05:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:05:45,050][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:05:45,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:05:45,695][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:05:46,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:05:46,342][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:05:46,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:05:46,988][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:05:47,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:05:47,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:05:47,956][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:05:48,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:05:48,600][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:05:48,923][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:05:49,247][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:05:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:05:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:05:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:05:50,539][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:05:50,862][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:05:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:05:51,506][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:05:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:05:52,150][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:05:52,473][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:05:52,796][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:05:53,120][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:05:53,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:05:54,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:10 [2025-11-13 09:05:54,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:54,836][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:54,838][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:56,029][__main__][INFO] - Iteration 176 took 22s (35.79% Gen, 58.81% Train). Generation: 7s, Training: 12s. Estimated remaining time: 17h 21m 53s. Estimated total time: 18h 23m 59s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 47s, 500 more iterations: 3h 3m 59s. [2025-11-13 09:05:56,031][__main__][INFO] - Starting iteration 176. [2025-11-13 09:05:56,034][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:05:56,035][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:04,343][__main__][INFO] - Number of regex retries in iteration 176: 0 [2025-11-13 09:06:04,344][__main__][INFO] - agents played in iteration 176 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:06:04,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:04,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:04,912][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:05,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:05,942][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:06,592][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:06,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:07,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:08,218][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:08,544][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:08,868][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:11,133][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:11,456][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:11,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:12,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:12,434][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:12,758][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:13,081][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:13,730][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:14,055][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:14,377][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:14,701][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:15,347][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:15,672][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:15,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:16,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:17,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:17,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:17,452][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:18,438][__main__][INFO] - Iteration 177 took 22s (37.08% Gen, 58.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 43s. Estimated total time: 18h 40m 12s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 20s, 500 more iterations: 3h 6m 42s. [2025-11-13 09:06:18,440][__main__][INFO] - Starting iteration 177. [2025-11-13 09:06:18,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:18,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:26,866][__main__][INFO] - Number of regex retries in iteration 177: 0 [2025-11-13 09:06:26,867][__main__][INFO] - agents played in iteration 177 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:06:27,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:27,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:27,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:29,419][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:29,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:30,067][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:30,393][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:31,047][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:31,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:32,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:32,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:33,639][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:33,965][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:34,290][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:34,942][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:35,265][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:35,591][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:36,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:06:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:06:37,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:06:37,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:06:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:06:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:06:38,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:06:39,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:06:39,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:06:39,982][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:06:39,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:06:41,109][__main__][INFO] - Iteration 178 took 22s (37.16% Gen, 57.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 50m 29s. Estimated total time: 18h 53m 21s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 53s. [2025-11-13 09:06:41,111][__main__][INFO] - Starting iteration 178. [2025-11-13 09:06:41,114][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:06:41,115][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:48,112][mllm.models.large_language_model_local][WARNING] - Response |), retry 1/1 [2025-11-13 09:06:49,967][__main__][INFO] - Number of regex retries in iteration 178: 1 [2025-11-13 09:06:49,968][__main__][INFO] - agents played in iteration 178 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:06:50,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:50,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:50,548][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:51,281][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:06:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:06:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:06:52,227][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:06:52,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:06:52,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:06:53,197][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:06:53,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:06:53,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:06:54,172][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:06:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:06:54,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:06:55,147][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:06:55,469][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:06:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:06:56,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:06:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:06:56,773][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:06:57,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:06:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:06:57,742][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:06:58,066][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:06:58,389][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:06:58,713][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:06:59,037][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:06:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:06:59,690][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:00,016][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:00,341][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:00,668][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:00,995][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:01,647][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:02,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:03,109][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:03,111][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:03,113][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:04,108][__main__][INFO] - Iteration 179 took 22s (38.50% Gen, 57.16% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 6m 30s. Estimated total time: 19h 9m 45s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 37s. [2025-11-13 09:07:04,110][__main__][INFO] - Starting iteration 179. [2025-11-13 09:07:04,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:04,114][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:12,723][__main__][INFO] - Number of regex retries in iteration 179: 0 [2025-11-13 09:07:12,724][__main__][INFO] - agents played in iteration 179 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:07:13,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:13,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:13,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:14,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:14,322][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:14,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:14,970][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:15,295][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:16,274][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:16,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:16,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:17,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:17,904][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:18,231][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:18,559][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:18,885][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:19,211][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:19,536][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:20,509][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:20,837][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:21,165][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:21,492][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:21,816][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:22,465][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:23,113][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:23,442][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:23,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:24,424][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:25,162][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:25,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:25,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:25,920][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:26,856][__main__][INFO] - Iteration 180 took 22s (37.86% Gen, 58.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 53m 34s. Estimated total time: 18h 57m 11s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 54s, 500 more iterations: 3h 9m 31s. [2025-11-13 09:07:26,858][__main__][INFO] - Starting iteration 180. [2025-11-13 09:07:26,861][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 09:07:26,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:35,130][__main__][INFO] - Number of regex retries in iteration 180: 0 [2025-11-13 09:07:35,131][__main__][INFO] - agents played in iteration 180 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:07:35,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:35,729][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:35,729][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:36,463][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:07:36,758][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:07:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:07:37,408][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:07:37,731][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:07:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:07:38,378][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:07:38,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:07:39,029][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:07:39,356][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:07:39,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:07:40,012][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:07:40,335][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:07:40,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:07:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:07:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:07:41,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:07:41,956][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:07:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:07:42,603][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:07:42,926][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:07:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:07:43,574][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:07:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:07:44,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:07:44,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:07:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:07:45,193][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:07:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:07:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:07:46,162][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:07:46,485][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:07:46,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:07:47,522][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:07:48,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:48,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:48,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:50,163][__main__][INFO] - Iteration 181 took 23s (35.49% Gen, 56.28% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 21m 7s. Estimated total time: 19h 25m 8s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 11s. [2025-11-13 09:07:50,165][__main__][INFO] - Starting iteration 181. [2025-11-13 09:07:50,168][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:07:50,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:58,950][__main__][INFO] - Number of regex retries in iteration 181: 0 [2025-11-13 09:07:58,950][__main__][INFO] - agents played in iteration 181 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:07:59,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:59,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:59,519][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:00,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:00,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:00,864][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:01,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:01,838][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:02,485][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:02,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:03,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:04,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:04,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:05,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:05,724][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:06,047][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:06,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:06,694][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:07,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:07,344][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:07,996][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:08,321][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:08,974][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:09,305][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:09,628][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:10,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:11,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:12,047][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:12,049][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:12,050][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:13,069][__main__][INFO] - Iteration 182 took 22s (38.34% Gen, 57.20% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 40s. Estimated total time: 19h 5m 3s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 10s, 500 more iterations: 3h 10m 50s. [2025-11-13 09:08:13,071][__main__][INFO] - Starting iteration 182. [2025-11-13 09:08:13,074][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:13,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:21,539][__main__][INFO] - Number of regex retries in iteration 182: 0 [2025-11-13 09:08:21,539][__main__][INFO] - agents played in iteration 182 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:08:22,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:22,106][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:22,107][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:22,830][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:23,451][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:24,419][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:25,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:25,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:26,042][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:26,689][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:27,015][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:27,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:27,990][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:28,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:28,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:28,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:29,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:29,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:29,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:30,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:30,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:30,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:31,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:32,211][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:33,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:33,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:34,573][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:34,574][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:34,576][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:35,592][__main__][INFO] - Iteration 183 took 22s (37.59% Gen, 57.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 41m 8s. Estimated total time: 18h 45m 55s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 31s, 500 more iterations: 3h 7m 39s. [2025-11-13 09:08:35,594][__main__][INFO] - Starting iteration 183. [2025-11-13 09:08:35,597][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:35,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:44,559][__main__][INFO] - Number of regex retries in iteration 183: 0 [2025-11-13 09:08:44,559][__main__][INFO] - agents played in iteration 183 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:08:45,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:45,134][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:45,134][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:45,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:08:46,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:08:46,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:08:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:08:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:08:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:08:47,762][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:08:48,084][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:08:48,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:08:48,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:08:49,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:08:49,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:08:49,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:08:50,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:08:50,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:08:50,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:08:50,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:08:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:08:51,648][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:08:51,977][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:08:52,300][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:08:52,624][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:08:52,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:08:53,275][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:08:53,603][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:08:53,928][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:08:54,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:08:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:08:54,902][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:08:55,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:08:55,551][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:08:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:08:56,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:08:56,881][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:08:57,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:57,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:57,608][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:58,587][__main__][INFO] - Iteration 184 took 22s (38.98% Gen, 56.76% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 4m 21s. Estimated total time: 19h 9m 31s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 35s. [2025-11-13 09:08:58,589][__main__][INFO] - Starting iteration 184. [2025-11-13 09:08:58,592][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:08:58,592][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:07,514][__main__][INFO] - Number of regex retries in iteration 184: 0 [2025-11-13 09:09:07,514][__main__][INFO] - agents played in iteration 184 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:09:07,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:08,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:08,085][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:08,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:09,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:09,752][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:10,076][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:10,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:11,045][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:11,693][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:12,022][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:12,995][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:13,321][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:13,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:13,972][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:14,296][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:14,620][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:14,948][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:15,597][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:15,924][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:16,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:16,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:17,234][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:17,883][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:18,206][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:18,536][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:18,864][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:19,187][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:19,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:20,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:20,586][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:20,587][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:21,734][__main__][INFO] - Iteration 185 took 23s (38.55% Gen, 56.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 11m 36s. Estimated total time: 19h 17m 9s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 51s. [2025-11-13 09:09:21,736][__main__][INFO] - Starting iteration 185. [2025-11-13 09:09:21,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:21,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:30,148][__main__][INFO] - Number of regex retries in iteration 185: 0 [2025-11-13 09:09:30,149][__main__][INFO] - agents played in iteration 185 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:09:30,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:30,724][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:30,725][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:31,452][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:31,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:32,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:32,395][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:32,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:33,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:34,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:34,341][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:34,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:35,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:35,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:36,293][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:36,616][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:09:36,939][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:09:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:09:37,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:09:37,914][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:09:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:09:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:09:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:09:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:09:39,536][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:09:39,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:09:40,186][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:09:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:09:40,832][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:09:41,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:09:41,481][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:09:41,805][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:09:42,514][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:09:43,250][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:43,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:43,253][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:44,236][__main__][INFO] - Iteration 186 took 22s (37.38% Gen, 58.25% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 38m 56s. Estimated total time: 18h 44m 51s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 29s, 500 more iterations: 3h 7m 28s. [2025-11-13 09:09:44,238][__main__][INFO] - Starting iteration 186. [2025-11-13 09:09:44,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:09:44,242][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:53,322][__main__][INFO] - Number of regex retries in iteration 186: 0 [2025-11-13 09:09:53,323][__main__][INFO] - agents played in iteration 186 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:09:53,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:53,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:53,907][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:54,647][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:09:54,943][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:09:55,269][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:09:55,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:09:55,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:09:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:09:56,565][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:09:56,890][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:09:57,218][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:09:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:09:57,868][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:09:58,191][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:09:58,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:09:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:09:59,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:09:59,487][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:09:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:00,458][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:00,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:01,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:01,759][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:02,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:02,410][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:02,733][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:03,384][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:03,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:04,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:04,361][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:04,690][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:05,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:05,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:06,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:06,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:06,473][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:07,472][__main__][INFO] - Iteration 187 took 23s (39.09% Gen, 56.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 15s. Estimated total time: 19h 21m 33s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 35s. [2025-11-13 09:10:07,474][__main__][INFO] - Starting iteration 187. [2025-11-13 09:10:07,478][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:07,478][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:16,494][__main__][INFO] - Number of regex retries in iteration 187: 0 [2025-11-13 09:10:16,495][__main__][INFO] - agents played in iteration 187 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:10:16,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:16,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:17,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:17,065][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:18,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:18,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:19,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:19,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:19,708][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:20,033][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:20,358][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:20,682][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:21,006][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:21,330][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:21,653][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:21,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:22,301][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:22,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:22,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:23,270][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:24,241][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:24,566][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:25,213][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:26,185][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:26,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:26,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:27,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:27,482][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:28,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:28,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:29,638][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:29,639][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:29,641][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:30,606][__main__][INFO] - Iteration 188 took 23s (38.98% Gen, 56.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 45s. Estimated total time: 19h 16m 26s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 44s. [2025-11-13 09:10:30,608][__main__][INFO] - Starting iteration 188. [2025-11-13 09:10:30,611][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:30,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:39,611][__main__][INFO] - Number of regex retries in iteration 188: 0 [2025-11-13 09:10:39,612][__main__][INFO] - agents played in iteration 188 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:10:40,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:40,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:40,185][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:40,900][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:10:41,194][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:10:41,519][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:10:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:10:42,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:10:42,491][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:10:42,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:10:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:10:43,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:10:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:10:44,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:10:44,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:10:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:10:45,079][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:10:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:10:45,726][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:10:46,050][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:10:46,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:10:46,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:10:47,020][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:10:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:10:47,667][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:10:47,990][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:10:48,314][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:10:48,637][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:10:48,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:10:49,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:10:49,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:10:49,934][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:10:50,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:10:50,582][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:10:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:10:51,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:10:51,953][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:10:52,685][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:52,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:52,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:53,744][__main__][INFO] - Iteration 189 took 23s (38.91% Gen, 56.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 37s. Estimated total time: 19h 16m 42s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 47s. [2025-11-13 09:10:53,746][__main__][INFO] - Starting iteration 189. [2025-11-13 09:10:53,749][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:10:53,749][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:02,523][__main__][INFO] - Number of regex retries in iteration 189: 0 [2025-11-13 09:11:02,524][__main__][INFO] - agents played in iteration 189 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:11:02,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:03,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:03,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:03,816][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:04,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:04,757][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:05,079][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:05,403][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:06,050][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:06,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:06,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:07,028][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:08,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:08,650][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:08,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:09,620][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:09,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:10,914][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:11,237][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:11,562][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:11,885][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:12,209][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:12,532][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:12,855][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:13,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:13,829][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:14,151][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:14,867][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:15,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:15,619][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:15,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:16,600][__main__][INFO] - Iteration 190 took 22s (38.39% Gen, 57.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 55m 8s. Estimated total time: 19h 2m 36s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 5s, 500 more iterations: 3h 10m 26s. [2025-11-13 09:11:16,602][__main__][INFO] - Starting iteration 190. [2025-11-13 09:11:16,605][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 09:11:16,606][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:25,642][__main__][INFO] - Number of regex retries in iteration 190: 0 [2025-11-13 09:11:25,643][__main__][INFO] - agents played in iteration 190 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:11:26,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:26,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:26,223][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:26,947][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:28,212][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:28,535][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:29,183][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:29,508][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:29,830][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:30,478][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:30,804][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:31,127][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:31,450][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:31,772][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:32,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:32,747][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:33,394][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:33,716][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:34,039][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:34,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:34,691][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:35,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:35,660][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:11:35,984][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:11:36,309][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:11:36,635][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:11:36,958][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:11:37,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:11:37,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:11:38,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:38,739][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:38,740][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:40,733][__main__][INFO] - Iteration 191 took 24s (37.45% Gen, 54.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 33s. Estimated total time: 20h 6m 25s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 12s, 500 more iterations: 3h 21m 4s. [2025-11-13 09:11:40,735][__main__][INFO] - Starting iteration 191. [2025-11-13 09:11:40,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:11:40,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:11:49,780][__main__][INFO] - Number of regex retries in iteration 191: 0 [2025-11-13 09:11:49,781][__main__][INFO] - agents played in iteration 191 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:11:50,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:11:50,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:11:50,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:11:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:11:51,348][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:11:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:11:51,996][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:11:52,323][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:11:52,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:11:52,975][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:11:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:11:53,631][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:11:53,953][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:11:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:11:54,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:11:54,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:11:55,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:11:55,583][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:11:55,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:11:56,233][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:11:56,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:11:56,886][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:11:57,213][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:11:57,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:11:57,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:11:58,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:11:58,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:11:58,845][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:11:59,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:11:59,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:11:59,819][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:00,145][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:00,470][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:01,451][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:02,163][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:02,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:02,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:02,916][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:03,844][__main__][INFO] - Iteration 192 took 23s (39.13% Gen, 56.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 7m 6s. Estimated total time: 19h 15m 21s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 33s. [2025-11-13 09:12:03,846][__main__][INFO] - Starting iteration 192. [2025-11-13 09:12:03,849][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:03,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:13,002][__main__][INFO] - Number of regex retries in iteration 192: 0 [2025-11-13 09:12:13,003][__main__][INFO] - agents played in iteration 192 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:12:13,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:13,586][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:13,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:14,299][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:14,593][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:15,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:15,564][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:15,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:16,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:17,184][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:17,506][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:17,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:18,799][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:19,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:19,445][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:19,769][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:20,093][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:20,418][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:20,740][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:21,065][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:21,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:21,711][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:22,035][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:22,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:23,005][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:23,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:24,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:25,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:26,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:26,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:26,103][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:27,014][__main__][INFO] - Iteration 193 took 23s (39.51% Gen, 56.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 40s. Estimated total time: 19h 18m 18s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 3s. [2025-11-13 09:12:27,016][__main__][INFO] - Starting iteration 193. [2025-11-13 09:12:27,019][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:27,019][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:35,563][__main__][INFO] - Number of regex retries in iteration 193: 0 [2025-11-13 09:12:35,564][__main__][INFO] - agents played in iteration 193 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:12:36,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:36,136][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:36,136][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:36,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:12:37,155][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:12:37,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:12:37,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:12:38,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:12:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:12:38,773][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:12:39,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:12:39,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:12:39,742][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:12:40,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:12:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:12:40,714][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:12:41,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:12:41,365][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:12:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:12:42,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:12:42,339][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:12:42,663][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:12:42,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:12:43,310][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:12:43,633][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:12:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:12:44,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:12:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:12:44,928][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:12:45,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:12:45,575][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:12:45,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:12:46,223][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:12:46,546][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:12:46,870][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:12:47,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:12:47,903][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:12:48,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:48,632][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:48,634][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:49,604][__main__][INFO] - Iteration 194 took 22s (37.83% Gen, 57.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 40m 18s. Estimated total time: 18h 49m 18s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 13s. [2025-11-13 09:12:49,607][__main__][INFO] - Starting iteration 194. [2025-11-13 09:12:49,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:12:49,611][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:58,833][__main__][INFO] - Number of regex retries in iteration 194: 0 [2025-11-13 09:12:58,834][__main__][INFO] - agents played in iteration 194 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:12:59,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:59,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:59,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:00,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:00,412][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:01,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:01,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:02,355][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:02,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:03,008][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:03,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:03,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:04,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:04,625][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:05,272][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:05,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:05,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:06,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:06,568][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:06,892][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:07,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:07,538][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:07,860][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:08,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:08,508][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:08,832][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:09,479][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:09,803][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:10,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:11,175][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:11,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:11,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:11,923][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:12,995][__main__][INFO] - Iteration 195 took 23s (39.44% Gen, 55.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 19m 55s. Estimated total time: 19h 29m 18s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 53s. [2025-11-13 09:13:12,997][__main__][INFO] - Starting iteration 195. [2025-11-13 09:13:13,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:13,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:22,203][__main__][INFO] - Number of regex retries in iteration 195: 0 [2025-11-13 09:13:22,203][__main__][INFO] - agents played in iteration 195 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:13:22,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:22,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:22,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:23,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:24,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:25,062][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:25,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:26,045][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:26,368][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:27,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:27,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:27,668][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:28,003][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:28,982][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:29,308][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:29,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:29,954][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:30,279][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:30,603][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:30,928][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:31,250][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:31,575][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:31,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:32,225][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:32,550][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:33,197][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:33,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:33,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:34,561][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:35,314][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:35,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:35,317][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:36,237][__main__][INFO] - Iteration 196 took 23s (39.60% Gen, 56.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 6s. Estimated total time: 19h 21m 54s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 39s. [2025-11-13 09:13:36,239][__main__][INFO] - Starting iteration 196. [2025-11-13 09:13:36,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:36,242][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:45,420][__main__][INFO] - Number of regex retries in iteration 196: 0 [2025-11-13 09:13:45,421][__main__][INFO] - agents played in iteration 196 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:13:45,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:45,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:45,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:46,688][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:13:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:13:47,308][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:13:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:13:47,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:13:48,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:13:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:13:48,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:13:49,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:13:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:13:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:13:50,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:13:50,565][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:13:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:13:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:13:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:13:51,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:13:52,191][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:13:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:13:52,839][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:13:53,163][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:13:53,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:13:53,811][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:13:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:13:54,465][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:13:54,793][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:13:55,122][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:13:55,448][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:13:55,770][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:13:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:13:56,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:13:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:13:57,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:13:57,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:13:58,526][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:58,528][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:58,529][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:59,430][__main__][INFO] - Iteration 197 took 23s (39.58% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 16s. Estimated total time: 19h 19m 26s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 14s. [2025-11-13 09:13:59,432][__main__][INFO] - Starting iteration 197. [2025-11-13 09:13:59,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:13:59,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:08,136][__main__][INFO] - Number of regex retries in iteration 197: 0 [2025-11-13 09:14:08,136][__main__][INFO] - agents played in iteration 197 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:14:08,601][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:08,702][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:08,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:09,415][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:09,709][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:11,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:11,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:11,989][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:12,312][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:13,936][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:14,265][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:14,590][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:14,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:15,242][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:16,552][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:17,530][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:17,861][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:18,189][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:18,840][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:19,165][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:19,490][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:19,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:20,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:21,291][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:21,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:21,294][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:22,325][__main__][INFO] - Iteration 198 took 22s (38.01% Gen, 57.48% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 54m 2s. Estimated total time: 19h 4m 35s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 45s. [2025-11-13 09:14:22,327][__main__][INFO] - Starting iteration 198. [2025-11-13 09:14:22,330][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:22,331][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:31,471][__main__][INFO] - Number of regex retries in iteration 198: 0 [2025-11-13 09:14:31,472][__main__][INFO] - agents played in iteration 198 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:14:31,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:31,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:32,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:32,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:32,042][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:32,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:32,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:33,059][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:33,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:33,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:34,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:34,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:35,338][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:35,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:36,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:14:36,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:14:37,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:14:37,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:14:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:14:38,274][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:14:38,597][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:14:38,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:14:39,253][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:14:39,579][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:14:39,904][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:14:40,229][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:14:40,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:14:40,877][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:14:41,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:14:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:14:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:14:42,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:14:42,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:14:42,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:14:43,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:14:43,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:14:44,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:44,635][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:44,636][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:45,597][__main__][INFO] - Iteration 199 took 23s (39.29% Gen, 56.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 12m 25s. Estimated total time: 19h 23m 22s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 53s. [2025-11-13 09:14:45,599][__main__][INFO] - Starting iteration 199. [2025-11-13 09:14:45,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:14:45,602][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:54,537][__main__][INFO] - Number of regex retries in iteration 199: 0 [2025-11-13 09:14:54,538][__main__][INFO] - agents played in iteration 199 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:14:55,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:55,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:55,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:55,842][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:14:56,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:14:56,459][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:14:56,781][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:14:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:14:57,429][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:14:57,752][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:14:58,077][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:14:58,403][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:14:58,727][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:14:59,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:14:59,380][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:14:59,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:00,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:00,350][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:00,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:00,997][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:01,644][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:02,292][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:02,615][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:02,940][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:03,588][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:03,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:04,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:05,209][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:05,535][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:06,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:06,901][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:07,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:07,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:07,656][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:08,656][__main__][INFO] - Iteration 200 took 23s (38.76% Gen, 56.90% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 1m 27s. Estimated total time: 19h 12m 46s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 7s. [2025-11-13 09:15:08,658][__main__][INFO] - Starting iteration 200. [2025-11-13 09:15:08,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 09:15:08,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:17,515][__main__][INFO] - Number of regex retries in iteration 200: 0 [2025-11-13 09:15:17,516][__main__][INFO] - agents played in iteration 200 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:15:17,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:18,081][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:18,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:18,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:20,086][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:20,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:20,734][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:21,058][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:21,383][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:21,707][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:22,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:22,679][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:23,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:23,328][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:23,651][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:23,974][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:24,296][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:24,622][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:24,948][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:25,270][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:25,919][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:26,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:26,566][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:27,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:27,868][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:28,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:28,516][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:29,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:29,882][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:30,617][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:30,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:30,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:32,544][__main__][INFO] - Iteration 201 took 23s (37.07% Gen, 54.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 42m 29s. Estimated total time: 19h 54m 12s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 2s. [2025-11-13 09:15:32,546][__main__][INFO] - Starting iteration 201. [2025-11-13 09:15:32,550][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:32,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:41,898][__main__][INFO] - Number of regex retries in iteration 201: 0 [2025-11-13 09:15:41,899][__main__][INFO] - agents played in iteration 201 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:15:42,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:42,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:42,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:43,203][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:15:43,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:15:43,820][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:15:44,145][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:15:44,467][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:15:44,790][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:15:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:15:45,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:15:45,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:15:46,086][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:15:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:15:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:15:47,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:15:47,384][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:15:47,709][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:15:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:15:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:15:48,680][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:15:49,004][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:15:49,329][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:15:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:15:49,984][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:15:50,308][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:15:50,635][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:15:50,962][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:15:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:15:51,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:15:51,936][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:15:52,262][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:15:52,585][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:15:52,908][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:15:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:15:53,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:15:54,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:15:55,018][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:15:55,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:15:55,021][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:15:56,015][__main__][INFO] - Iteration 202 took 23s (39.83% Gen, 55.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 21m 11s. Estimated total time: 19h 33m 18s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 33s. [2025-11-13 09:15:56,017][__main__][INFO] - Starting iteration 202. [2025-11-13 09:15:56,020][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:15:56,021][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:05,218][__main__][INFO] - Number of regex retries in iteration 202: 0 [2025-11-13 09:16:05,219][__main__][INFO] - agents played in iteration 202 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:16:05,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:05,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:05,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:05,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:05,789][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:05,789][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:06,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:07,444][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:07,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:08,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:08,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:09,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:09,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:09,713][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:10,036][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:10,359][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:10,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:11,007][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:11,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:11,978][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:12,627][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:12,952][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:13,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:13,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:14,248][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:14,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:15,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:15,546][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:15,870][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:16,193][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:16,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:16,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:17,535][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:18,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:18,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:18,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:19,279][__main__][INFO] - Iteration 203 took 23s (39.55% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 10m 28s. Estimated total time: 19h 22m 58s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 49s. [2025-11-13 09:16:19,281][__main__][INFO] - Starting iteration 203. [2025-11-13 09:16:19,284][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:19,285][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:28,732][__main__][INFO] - Number of regex retries in iteration 203: 0 [2025-11-13 09:16:28,733][__main__][INFO] - agents played in iteration 203 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:16:29,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:29,317][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:29,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:30,051][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:30,669][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:30,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:31,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:31,964][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:32,286][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:32,933][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:33,256][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:33,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:34,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:34,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:35,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:35,529][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:36,175][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:16:36,500][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:16:36,824][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:16:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:16:37,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:16:37,793][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:16:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:16:38,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:16:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:16:39,089][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:16:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:16:39,733][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:16:40,056][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:16:40,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:16:41,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:16:41,843][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:41,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:41,848][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:42,832][__main__][INFO] - Iteration 204 took 23s (40.12% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 24m 32s. Estimated total time: 19h 37m 26s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 14s. [2025-11-13 09:16:42,834][__main__][INFO] - Starting iteration 204. [2025-11-13 09:16:42,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:16:42,838][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:52,322][__main__][INFO] - Number of regex retries in iteration 204: 0 [2025-11-13 09:16:52,323][__main__][INFO] - agents played in iteration 204 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:16:52,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:52,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:52,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:53,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:16:53,929][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:16:54,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:16:54,582][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:16:54,908][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:16:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:16:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:16:55,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:16:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:16:56,540][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:16:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:16:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:16:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:16:57,841][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:16:58,168][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:16:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:16:58,823][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:16:59,151][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:16:59,477][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:16:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:00,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:01,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:01,765][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:02,093][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:02,749][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:03,403][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:04,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:04,788][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:05,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:05,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:05,547][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:06,545][__main__][INFO] - Iteration 205 took 23s (40.01% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 32m 8s. Estimated total time: 19h 45m 26s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 34s. [2025-11-13 09:17:06,548][__main__][INFO] - Starting iteration 205. [2025-11-13 09:17:06,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:06,552][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:16,026][__main__][INFO] - Number of regex retries in iteration 205: 0 [2025-11-13 09:17:16,026][__main__][INFO] - agents played in iteration 205 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:17:16,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:16,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:16,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:17,332][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:17,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:17,949][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:18,273][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:18,597][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:18,921][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:19,249][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:20,222][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:20,548][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:20,870][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:22,166][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:22,489][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:22,813][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:23,135][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:23,460][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:24,109][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:24,754][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:25,079][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:25,404][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:25,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:26,052][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:27,023][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:27,347][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:27,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:28,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:29,155][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:29,157][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:29,158][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:30,161][__main__][INFO] - Iteration 206 took 23s (40.13% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 26m 50s. Estimated total time: 19h 40m 31s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 45s. [2025-11-13 09:17:30,163][__main__][INFO] - Starting iteration 206. [2025-11-13 09:17:30,167][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:30,168][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:39,630][__main__][INFO] - Number of regex retries in iteration 206: 0 [2025-11-13 09:17:39,631][__main__][INFO] - agents played in iteration 206 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:17:40,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:40,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:40,947][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:17:41,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:17:41,571][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:17:41,895][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:17:42,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:17:42,542][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:17:42,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:17:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:17:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:17:43,834][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:17:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:17:44,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:17:44,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:17:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:17:45,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:17:45,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:17:46,100][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:17:46,429][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:17:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:17:47,078][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:17:47,404][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:17:47,727][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:17:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:17:48,373][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:17:48,697][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:17:49,021][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:17:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:17:49,669][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:17:49,992][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:17:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:17:50,638][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:17:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:17:51,285][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:17:52,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:17:52,767][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:52,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:52,770][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:53,807][__main__][INFO] - Iteration 207 took 23s (40.03% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 56s. Estimated total time: 19h 42m 1s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 0s. [2025-11-13 09:17:53,810][__main__][INFO] - Starting iteration 207. [2025-11-13 09:17:53,813][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:17:53,814][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:03,103][__main__][INFO] - Number of regex retries in iteration 207: 0 [2025-11-13 09:18:03,104][__main__][INFO] - agents played in iteration 207 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:18:03,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:03,686][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:03,686][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:04,412][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:04,706][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:05,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:05,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:06,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:06,329][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:06,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:07,307][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:07,632][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:08,280][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:08,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:08,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:09,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:09,585][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:09,910][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:10,238][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:10,890][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:11,538][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:11,862][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:12,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:12,837][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:13,160][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:13,482][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:13,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:14,136][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:14,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:14,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:15,502][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:16,234][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:16,236][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:16,237][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:17,185][__main__][INFO] - Iteration 208 took 23s (39.75% Gen, 56.19% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 14m 10s. Estimated total time: 19h 28m 38s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 46s. [2025-11-13 09:18:17,187][__main__][INFO] - Starting iteration 208. [2025-11-13 09:18:17,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:17,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:26,782][__main__][INFO] - Number of regex retries in iteration 208: 0 [2025-11-13 09:18:26,783][__main__][INFO] - agents played in iteration 208 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:18:27,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:27,350][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:27,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:28,069][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:28,687][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:29,011][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:29,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:29,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:30,306][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:30,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:31,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:31,925][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:32,571][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:32,894][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:33,217][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:33,542][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:33,867][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:34,837][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:35,159][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:36,130][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:36,454][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:18:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:18:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:18:37,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:18:37,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:18:38,078][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:18:38,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:18:39,137][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:18:39,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:39,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:39,889][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:40,853][__main__][INFO] - Iteration 209 took 23s (40.53% Gen, 55.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 28m 19s. Estimated total time: 19h 43m 11s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 11s. [2025-11-13 09:18:40,855][__main__][INFO] - Starting iteration 209. [2025-11-13 09:18:40,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:18:40,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:50,310][__main__][INFO] - Number of regex retries in iteration 209: 0 [2025-11-13 09:18:50,310][__main__][INFO] - agents played in iteration 209 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:18:50,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:50,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:50,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:51,581][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:18:51,875][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:18:52,201][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:18:52,527][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:18:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:18:53,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:18:53,503][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:18:53,825][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:18:54,150][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:18:54,475][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:18:54,799][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:18:55,123][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:18:55,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:18:55,771][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:18:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:18:56,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:18:56,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:18:57,071][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:18:57,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:18:57,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:18:58,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:18:58,365][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:18:58,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:18:59,013][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:18:59,336][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:18:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:18:59,983][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:00,307][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:00,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:00,955][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:01,279][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:01,605][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:01,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:02,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:03,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:03,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:03,387][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:04,370][__main__][INFO] - Iteration 210 took 23s (40.20% Gen, 55.62% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 24s. Estimated total time: 19h 35m 39s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 56s. [2025-11-13 09:19:04,372][__main__][INFO] - Starting iteration 210. [2025-11-13 09:19:04,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 09:19:04,375][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:13,561][__main__][INFO] - Number of regex retries in iteration 210: 0 [2025-11-13 09:19:13,562][__main__][INFO] - agents played in iteration 210 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:19:14,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:14,130][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:14,130][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:14,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:15,467][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:15,790][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:16,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:16,437][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:16,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:17,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:17,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:18,054][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:18,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:19,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:19,670][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:19,992][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:20,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:20,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:20,960][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:21,284][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:22,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:22,580][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:22,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:23,236][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:23,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:24,218][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:24,544][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:24,871][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:25,198][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:25,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:26,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:26,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:26,651][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:28,650][__main__][INFO] - Iteration 211 took 24s (37.84% Gen, 53.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 10s. Estimated total time: 20h 13m 50s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 27s, 500 more iterations: 3h 22m 18s. [2025-11-13 09:19:28,652][__main__][INFO] - Starting iteration 211. [2025-11-13 09:19:28,655][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:28,655][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:37,934][__main__][INFO] - Number of regex retries in iteration 211: 0 [2025-11-13 09:19:37,935][__main__][INFO] - agents played in iteration 211 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:19:38,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:38,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:38,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:39,584][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:19:39,879][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:19:40,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:19:40,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:19:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:19:41,175][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:19:41,498][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:19:41,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:19:42,150][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:19:42,472][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:19:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:19:43,117][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:19:43,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:19:43,765][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:19:44,090][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:19:44,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:19:44,743][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:19:45,072][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:19:45,397][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:19:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:19:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:19:46,366][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:19:46,690][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:19:47,012][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:19:47,335][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:19:47,661][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:19:47,988][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:19:48,313][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:19:48,636][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:19:48,959][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:19:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:19:49,607][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:19:49,930][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:19:50,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:19:51,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:51,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:51,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:52,356][__main__][INFO] - Iteration 212 took 23s (39.15% Gen, 56.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 4s. Estimated total time: 19h 45m 7s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 31s. [2025-11-13 09:19:52,358][__main__][INFO] - Starting iteration 212. [2025-11-13 09:19:52,361][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:19:52,361][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:02,063][__main__][INFO] - Number of regex retries in iteration 212: 0 [2025-11-13 09:20:02,064][__main__][INFO] - agents played in iteration 212 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:20:02,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:02,636][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:02,637][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:03,354][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:03,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:04,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:04,622][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:05,270][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:05,917][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:06,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:06,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:07,226][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:08,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:08,527][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:08,855][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:09,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:09,502][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:09,829][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:10,154][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:10,477][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:10,800][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:11,124][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:11,773][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:12,098][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:12,423][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:12,747][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:13,073][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:13,400][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:13,729][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:14,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:15,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:15,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:15,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:16,264][__main__][INFO] - Iteration 213 took 23s (40.59% Gen, 54.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 45s. Estimated total time: 19h 55m 12s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 12s. [2025-11-13 09:20:16,266][__main__][INFO] - Starting iteration 213. [2025-11-13 09:20:16,269][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:16,270][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:25,833][__main__][INFO] - Number of regex retries in iteration 213: 0 [2025-11-13 09:20:25,833][__main__][INFO] - agents played in iteration 213 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:20:26,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:26,398][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:26,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:27,099][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:27,716][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:28,038][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:29,008][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:29,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:29,658][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:29,981][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:30,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:30,627][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:30,950][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:31,274][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:32,246][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:32,570][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:32,895][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:33,543][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:33,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:34,513][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:35,486][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:36,457][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:37,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:20:37,430][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:20:38,147][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:20:38,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:38,902][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:38,904][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:39,887][__main__][INFO] - Iteration 214 took 23s (40.49% Gen, 55.34% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 24m 6s. Estimated total time: 19h 40m 57s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 09:20:39,889][__main__][INFO] - Starting iteration 214. [2025-11-13 09:20:39,892][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:20:39,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:48,489][__main__][INFO] - Number of regex retries in iteration 214: 0 [2025-11-13 09:20:48,490][__main__][INFO] - agents played in iteration 214 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:20:48,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:48,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:49,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:49,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:20:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:20:50,412][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:20:50,735][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:20:51,060][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:20:51,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:20:51,705][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:20:52,029][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:20:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:20:52,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:20:53,002][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:20:53,327][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:20:53,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:20:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:20:54,296][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:20:54,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:20:54,945][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:20:55,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:20:55,594][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:20:55,922][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:20:56,248][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:20:56,573][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:20:56,896][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:20:57,221][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:20:57,544][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:20:57,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:20:58,194][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:20:58,520][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:20:58,843][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:20:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:20:59,489][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:20:59,813][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:00,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:00,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:01,556][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:01,558][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:01,559][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:02,544][__main__][INFO] - Iteration 215 took 22s (37.95% Gen, 57.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 35m 25s. Estimated total time: 18h 52m 38s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 46s. [2025-11-13 09:21:02,546][__main__][INFO] - Starting iteration 215. [2025-11-13 09:21:02,550][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:02,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:11,541][__main__][INFO] - Number of regex retries in iteration 215: 0 [2025-11-13 09:21:11,542][__main__][INFO] - agents played in iteration 215 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:21:11,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:12,091][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:12,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:12,838][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:13,134][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:13,465][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:13,791][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:14,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:14,763][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:15,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:15,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:15,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:16,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:16,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:16,715][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:17,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:17,689][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:18,013][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:18,660][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:18,984][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:19,308][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:19,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:19,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:20,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:20,605][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:20,933][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:21,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:21,581][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:21,908][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:22,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:22,558][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:22,881][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:23,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:23,870][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:24,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:24,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:24,629][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:25,614][__main__][INFO] - Iteration 216 took 23s (38.98% Gen, 56.74% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 55m 36s. Estimated total time: 19h 13m 12s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 12s. [2025-11-13 09:21:25,616][__main__][INFO] - Starting iteration 216. [2025-11-13 09:21:25,619][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:25,619][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:35,250][__main__][INFO] - Number of regex retries in iteration 216: 0 [2025-11-13 09:21:35,251][__main__][INFO] - agents played in iteration 216 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:21:35,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:35,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:35,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:35,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:35,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:35,793][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:21:36,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:21:36,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:21:37,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:21:37,467][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:21:37,792][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:21:38,116][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:21:38,443][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:21:38,770][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:21:39,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:21:39,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:21:39,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:21:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:21:40,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:21:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:21:41,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:21:41,365][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:21:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:21:42,012][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:21:42,337][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:21:42,661][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:21:42,985][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:21:43,310][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:21:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:21:43,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:21:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:21:44,607][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:21:44,929][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:21:45,254][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:21:45,577][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:21:45,903][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:21:46,227][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:21:46,551][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:21:46,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:21:47,545][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:21:48,274][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:48,276][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:48,277][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:49,261][__main__][INFO] - Iteration 217 took 23s (40.74% Gen, 55.10% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 24m 9s. Estimated total time: 19h 42m 9s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 1s. [2025-11-13 09:21:49,263][__main__][INFO] - Starting iteration 217. [2025-11-13 09:21:49,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:21:49,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:21:58,912][__main__][INFO] - Number of regex retries in iteration 217: 0 [2025-11-13 09:21:58,912][__main__][INFO] - agents played in iteration 217 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:21:59,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:21:59,473][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:21:59,474][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:00,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:01,167][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:01,491][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:01,816][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:02,143][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:02,467][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:02,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:03,112][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:03,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:03,759][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:04,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:05,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:05,380][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:05,704][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:06,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:06,353][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:06,677][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:07,002][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:07,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:07,972][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:08,296][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:08,621][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:08,944][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:09,267][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:09,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:10,239][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:10,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:11,251][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:11,996][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:11,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:11,999][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:12,993][__main__][INFO] - Iteration 218 took 23s (40.65% Gen, 55.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 59s. Estimated total time: 19h 46m 23s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 43s. [2025-11-13 09:22:12,995][__main__][INFO] - Starting iteration 218. [2025-11-13 09:22:12,999][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:13,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:21,784][__main__][INFO] - Number of regex retries in iteration 218: 0 [2025-11-13 09:22:21,784][__main__][INFO] - agents played in iteration 218 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:22:22,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:22,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:22,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:22,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:22,335][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:22,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:23,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:23,322][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:23,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:23,972][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:24,623][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:24,947][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:25,271][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:25,595][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:26,240][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:26,564][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:26,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:28,180][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:28,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:28,826][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:29,149][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:29,473][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:29,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:30,124][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:30,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:31,095][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:31,418][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:31,743][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:32,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:32,715][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:33,038][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:33,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:34,025][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:34,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:34,762][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:34,763][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:35,733][__main__][INFO] - Iteration 219 took 22s (38.64% Gen, 57.09% Train). Generation: 8s, Training: 12s. Estimated remaining time: 17h 37m 56s. Estimated total time: 18h 56m 43s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 27s. [2025-11-13 09:22:35,735][__main__][INFO] - Starting iteration 219. [2025-11-13 09:22:35,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:35,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:44,684][__main__][INFO] - Number of regex retries in iteration 219: 0 [2025-11-13 09:22:44,685][__main__][INFO] - agents played in iteration 219 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:22:45,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:45,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:45,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:45,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:45,238][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:45,238][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:45,959][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:22:46,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:22:46,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:22:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:22:47,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:22:47,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:22:47,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:22:48,210][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:22:48,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:22:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:22:49,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:22:49,507][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:22:49,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:22:50,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:22:50,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:22:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:22:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:22:51,450][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:22:51,775][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:22:52,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:22:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:22:52,747][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:22:53,071][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:22:53,394][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:22:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:22:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:22:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:22:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:22:55,012][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:22:55,335][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:22:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:22:55,982][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:22:56,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:22:56,976][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:22:57,707][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:57,709][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:57,711][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:58,734][__main__][INFO] - Iteration 220 took 22s (38.90% Gen, 56.64% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 50m 42s. Estimated total time: 19h 9m 52s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 38s. [2025-11-13 09:22:58,736][__main__][INFO] - Starting iteration 220. [2025-11-13 09:22:58,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 09:22:58,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:07,956][__main__][INFO] - Number of regex retries in iteration 220: 0 [2025-11-13 09:23:07,957][__main__][INFO] - agents played in iteration 220 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:23:08,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:08,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:08,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:08,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:08,517][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:08,517][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:09,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:09,524][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:10,174][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:10,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:10,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:11,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:11,793][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:12,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:12,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:13,088][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:13,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:14,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:14,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:15,031][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:15,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:15,679][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:16,650][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:16,975][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:17,298][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:17,622][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:17,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:18,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:18,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:19,237][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:19,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:20,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:20,994][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:20,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:20,997][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:22,893][__main__][INFO] - Iteration 221 took 24s (38.16% Gen, 53.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 48m 9s. Estimated total time: 20h 7m 43s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 17s. [2025-11-13 09:23:22,895][__main__][INFO] - Starting iteration 221. [2025-11-13 09:23:22,899][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:22,900][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:32,213][__main__][INFO] - Number of regex retries in iteration 221: 0 [2025-11-13 09:23:32,213][__main__][INFO] - agents played in iteration 221 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:23:32,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:32,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:32,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:32,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:32,799][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:32,800][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:33,509][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:34,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:34,454][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:34,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:35,112][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:35,760][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:36,409][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:23:36,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:23:37,057][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:23:37,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:23:37,705][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:23:38,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:23:38,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:23:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:23:38,999][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:23:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:23:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:23:39,969][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:23:40,292][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:23:40,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:23:40,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:23:41,265][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:23:41,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:23:41,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:23:42,237][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:23:42,562][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:23:42,884][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:23:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:23:43,532][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:23:43,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:23:44,522][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:23:45,255][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:45,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:45,258][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:46,230][__main__][INFO] - Iteration 222 took 23s (39.92% Gen, 55.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 6m 37s. Estimated total time: 19h 26m 34s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 25s. [2025-11-13 09:23:46,232][__main__][INFO] - Starting iteration 222. [2025-11-13 09:23:46,235][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:23:46,236][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:55,806][__main__][INFO] - Number of regex retries in iteration 222: 0 [2025-11-13 09:23:55,807][__main__][INFO] - agents played in iteration 222 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:23:56,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:56,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:56,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:56,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:56,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:56,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:57,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:23:57,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:23:57,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:23:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:23:58,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:23:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:23:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:23:59,330][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:23:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:23:59,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:00,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:00,947][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:01,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:01,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:02,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:02,564][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:02,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:03,535][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:03,858][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:04,181][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:04,506][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:04,831][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:05,155][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:05,483][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:05,808][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:06,133][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:06,786][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:07,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:08,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:08,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:08,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:08,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:09,853][__main__][INFO] - Iteration 223 took 23s (40.52% Gen, 55.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 20m 33s. Estimated total time: 19h 40m 54s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 09:24:09,855][__main__][INFO] - Starting iteration 223. [2025-11-13 09:24:09,858][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:09,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:19,253][__main__][INFO] - Number of regex retries in iteration 223: 0 [2025-11-13 09:24:19,254][__main__][INFO] - agents played in iteration 223 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:24:19,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:19,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:19,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:19,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:19,809][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:19,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:20,499][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:20,797][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:21,123][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:21,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:22,095][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:22,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:22,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:23,069][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:23,392][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:23,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:24,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:25,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:25,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:25,665][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:25,988][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:26,313][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:26,636][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:26,962][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:27,285][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:27,609][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:27,932][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:28,257][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:28,578][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:28,903][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:29,226][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:29,548][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:29,872][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:30,196][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:30,521][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:30,844][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:31,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:32,246][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:32,248][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:32,250][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:33,479][__main__][INFO] - Iteration 224 took 23s (39.77% Gen, 55.02% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 20m 20s. Estimated total time: 19h 41m 5s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 50s. [2025-11-13 09:24:33,481][__main__][INFO] - Starting iteration 224. [2025-11-13 09:24:33,485][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:33,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:42,518][__main__][INFO] - Number of regex retries in iteration 224: 0 [2025-11-13 09:24:42,519][__main__][INFO] - agents played in iteration 224 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:24:42,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:43,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:43,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:43,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:43,080][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:43,080][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:43,783][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:24:44,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:24:44,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:24:44,732][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:24:45,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:24:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:24:45,712][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:24:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:24:46,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:24:46,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:24:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:24:47,342][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:24:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:24:47,988][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:24:48,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:24:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:24:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:24:49,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:24:49,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:24:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:24:50,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:24:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:24:50,911][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:24:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:24:51,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:24:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:24:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:24:52,533][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:24:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:24:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:24:53,510][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:24:53,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:24:54,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:24:54,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:24:55,563][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:24:55,565][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:24:55,566][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:24:56,599][__main__][INFO] - Iteration 225 took 23s (39.08% Gen, 56.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 36s. Estimated total time: 19h 15m 43s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 37s. [2025-11-13 09:24:56,600][__main__][INFO] - Starting iteration 225. [2025-11-13 09:24:56,603][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:24:56,604][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:05,944][__main__][INFO] - Number of regex retries in iteration 225: 0 [2025-11-13 09:25:05,945][__main__][INFO] - agents played in iteration 225 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:25:06,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:06,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:06,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:06,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:06,544][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:06,544][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:07,245][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:07,541][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:07,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:08,192][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:08,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:09,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:10,145][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:10,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:10,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:12,092][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:12,417][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:12,742][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:13,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:13,716][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:14,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:14,365][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:14,689][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:15,015][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:15,340][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:15,663][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:15,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:16,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:16,636][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:16,958][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:17,283][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:17,607][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:18,274][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:19,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:19,008][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:19,010][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:20,149][__main__][INFO] - Iteration 226 took 23s (39.67% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 15m 50s. Estimated total time: 19h 37m 21s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 13s. [2025-11-13 09:25:20,152][__main__][INFO] - Starting iteration 226. [2025-11-13 09:25:20,155][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:20,156][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:29,011][__main__][INFO] - Number of regex retries in iteration 226: 0 [2025-11-13 09:25:29,012][__main__][INFO] - agents played in iteration 226 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:25:29,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:29,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:29,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:29,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:29,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:29,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:30,601][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:30,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:31,223][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:31,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:31,881][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:32,211][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:32,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:33,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:33,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:34,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:34,492][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:34,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:35,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:35,464][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:35,789][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:36,764][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:37,087][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:25:37,410][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:25:37,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:25:38,061][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:25:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:25:38,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:25:39,037][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:25:39,363][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:25:39,689][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:25:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:25:40,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:25:40,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:25:40,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:25:41,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:25:42,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:42,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:42,412][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:43,396][__main__][INFO] - Iteration 227 took 23s (38.10% Gen, 57.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 0m 12s. Estimated total time: 19h 22m 6s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 41s. [2025-11-13 09:25:43,399][__main__][INFO] - Starting iteration 227. [2025-11-13 09:25:43,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:25:43,403][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:52,110][__main__][INFO] - Number of regex retries in iteration 227: 0 [2025-11-13 09:25:52,110][__main__][INFO] - agents played in iteration 227 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:25:52,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:52,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:52,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:52,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:52,689][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:52,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:53,390][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:25:53,686][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:25:54,011][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:25:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:25:54,665][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:25:54,990][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:25:55,315][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:25:55,640][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:25:55,964][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:25:56,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:25:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:25:56,941][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:25:57,264][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:25:57,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:25:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:25:58,234][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:25:58,557][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:25:58,879][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:25:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:25:59,532][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:25:59,858][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:00,182][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:00,830][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:01,153][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:01,804][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:02,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:02,452][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:02,774][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:03,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:03,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:03,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:04,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:05,145][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:05,146][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:05,148][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:06,180][__main__][INFO] - Iteration 228 took 22s (38.22% Gen, 57.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 36m 38s. Estimated total time: 18h 58m 55s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 49s. [2025-11-13 09:26:06,182][__main__][INFO] - Starting iteration 228. [2025-11-13 09:26:06,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:06,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:14,848][__main__][INFO] - Number of regex retries in iteration 228: 0 [2025-11-13 09:26:14,849][__main__][INFO] - agents played in iteration 228 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:26:15,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:15,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:15,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:15,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:15,409][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:15,410][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:16,114][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:16,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:17,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:17,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:18,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:18,363][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:19,015][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:19,343][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:19,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:19,992][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:20,318][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:20,640][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:20,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:21,287][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:21,610][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:21,937][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:22,262][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:22,587][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:22,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:23,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:23,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:23,882][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:24,851][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:25,503][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:25,828][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:26,154][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:26,483][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:27,151][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:27,887][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:27,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:27,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:28,922][__main__][INFO] - Iteration 229 took 22s (38.10% Gen, 57.36% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 34m 15s. Estimated total time: 18h 56m 55s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 29s. [2025-11-13 09:26:28,924][__main__][INFO] - Starting iteration 229. [2025-11-13 09:26:28,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:28,928][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:36,761][__main__][INFO] - Number of regex retries in iteration 229: 0 [2025-11-13 09:26:36,762][__main__][INFO] - agents played in iteration 229 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:26:37,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:37,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:37,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:37,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:37,326][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:37,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:38,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:26:38,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:26:38,667][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:26:38,990][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:26:39,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:26:39,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:26:39,965][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:26:40,289][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:26:40,618][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:26:40,945][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:26:41,274][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:26:41,604][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:26:41,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:26:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:26:42,590][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:26:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:26:43,247][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:26:43,573][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:26:43,898][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:26:44,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:26:44,551][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:26:44,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:26:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:26:45,531][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:26:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:26:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:26:46,517][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:26:46,843][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:26:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:26:47,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:26:47,818][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:26:48,143][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:26:48,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:26:49,155][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:26:49,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:49,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:49,906][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:50,935][__main__][INFO] - Iteration 230 took 22s (35.60% Gen, 59.72% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 57m 24s. Estimated total time: 18h 20m 26s. Time estimates for 10 more iterations: 3m 40s, 100 more iterations: 36m 40s, 500 more iterations: 3h 3m 24s. [2025-11-13 09:26:50,937][__main__][INFO] - Starting iteration 230. [2025-11-13 09:26:50,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 09:26:50,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:58,628][__main__][INFO] - Number of regex retries in iteration 230: 0 [2025-11-13 09:26:58,629][__main__][INFO] - agents played in iteration 230 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:26:59,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:59,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:59,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:59,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:59,208][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:59,209][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:59,927][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:00,547][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:00,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:01,196][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:01,521][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:01,846][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:02,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:02,828][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:03,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:03,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:04,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:04,781][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:05,763][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:06,090][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:07,396][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:07,721][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:08,044][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:08,368][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:08,694][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:09,019][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:09,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:09,666][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:10,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:10,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:11,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:11,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:11,719][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:13,644][__main__][INFO] - Iteration 231 took 22s (33.86% Gen, 57.65% Train). Generation: 7s, Training: 13s. Estimated remaining time: 17h 31m 48s. Estimated total time: 18h 55m 13s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 12s. [2025-11-13 09:27:13,646][__main__][INFO] - Starting iteration 231. [2025-11-13 09:27:13,650][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:13,651][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:22,120][__main__][INFO] - Number of regex retries in iteration 231: 0 [2025-11-13 09:27:22,121][__main__][INFO] - agents played in iteration 231 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:27:22,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:22,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:22,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:22,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:22,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:22,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:23,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:23,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:24,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:24,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:24,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:25,333][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:25,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:25,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:26,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:26,629][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:26,951][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:27,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:27,597][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:27,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:28,250][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:28,575][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:28,902][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:29,232][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:29,559][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:29,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:30,211][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:30,540][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:31,192][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:31,520][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:32,165][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:32,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:33,142][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:33,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:34,459][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:35,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:35,194][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:35,196][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:36,311][__main__][INFO] - Iteration 232 took 22s (37.37% Gen, 57.69% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 20s. Estimated total time: 18h 53m 7s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 46s, 500 more iterations: 3h 8m 51s. [2025-11-13 09:27:36,314][__main__][INFO] - Starting iteration 232. [2025-11-13 09:27:36,317][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:36,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:44,675][__main__][INFO] - Number of regex retries in iteration 232: 0 [2025-11-13 09:27:44,675][__main__][INFO] - agents played in iteration 232 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:27:45,161][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:45,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:45,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:45,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:45,262][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:45,263][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:45,984][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:27:46,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:27:46,603][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:27:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:27:47,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:27:47,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:27:47,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:27:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:27:48,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:27:48,871][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:27:49,196][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:27:49,521][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:27:49,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:27:50,175][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:27:50,498][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:27:50,827][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:27:51,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:27:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:27:51,805][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:27:52,131][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:27:52,455][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:27:52,779][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:27:53,104][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:27:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:27:53,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:27:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:27:54,400][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:27:54,726][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:27:55,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:27:55,373][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:27:55,696][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:27:56,019][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:27:56,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:27:57,014][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:27:57,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:57,742][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:57,744][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:58,709][__main__][INFO] - Iteration 233 took 22s (37.32% Gen, 58.36% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 29s. Estimated total time: 18h 39m 39s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 19s, 500 more iterations: 3h 6m 36s. [2025-11-13 09:27:58,711][__main__][INFO] - Starting iteration 233. [2025-11-13 09:27:58,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:27:58,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:07,061][__main__][INFO] - Number of regex retries in iteration 233: 0 [2025-11-13 09:28:07,062][__main__][INFO] - agents played in iteration 233 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:28:07,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:07,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:07,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:07,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:07,639][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:07,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:08,359][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:08,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:10,601][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:11,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:11,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:11,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:12,220][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:12,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:13,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:13,523][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:13,847][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:14,494][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:14,821][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:15,473][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:15,799][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:16,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:17,100][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:17,425][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:17,751][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:18,401][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:18,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:19,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:20,148][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:20,150][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:20,152][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:21,133][__main__][INFO] - Iteration 234 took 22s (37.23% Gen, 58.39% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 16m 26s. Estimated total time: 18h 40m 58s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 21s, 500 more iterations: 3h 6m 49s. [2025-11-13 09:28:21,135][__main__][INFO] - Starting iteration 234. [2025-11-13 09:28:21,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:21,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:29,418][__main__][INFO] - Number of regex retries in iteration 234: 0 [2025-11-13 09:28:29,419][__main__][INFO] - agents played in iteration 234 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:28:29,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:29,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:29,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:29,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:29,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:29,985][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:30,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:31,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:31,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:32,005][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:32,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:32,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:32,975][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:33,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:33,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:34,915][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:35,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:35,888][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:36,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:36,866][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:37,191][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:28:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:28:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:28:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:28:38,490][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:28:38,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:28:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:28:39,470][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:28:39,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:28:40,120][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:28:40,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:28:40,770][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:28:41,096][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:28:41,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:28:42,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:42,488][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:42,490][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:43,451][__main__][INFO] - Iteration 235 took 22s (37.11% Gen, 58.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 10m 48s. Estimated total time: 18h 35m 42s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 11s, 500 more iterations: 3h 5m 57s. [2025-11-13 09:28:43,453][__main__][INFO] - Starting iteration 235. [2025-11-13 09:28:43,457][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:28:43,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:52,022][__main__][INFO] - Number of regex retries in iteration 235: 0 [2025-11-13 09:28:52,023][__main__][INFO] - agents played in iteration 235 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:28:52,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:52,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:52,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:52,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:52,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:52,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:53,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:28:53,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:28:53,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:28:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:28:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:28:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:28:55,248][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:28:55,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:28:55,895][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:28:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:28:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:28:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:28:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:28:57,514][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:28:57,840][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:28:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:28:58,487][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:28:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:28:59,135][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:28:59,460][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:28:59,790][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:00,123][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:00,449][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:00,772][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:01,424][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:01,752][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:02,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:02,403][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:02,727][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:03,383][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:03,710][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:04,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:05,154][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:05,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:05,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:06,214][__main__][INFO] - Iteration 236 took 22s (37.64% Gen, 57.72% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 32m 37s. Estimated total time: 18h 57m 54s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 55s, 500 more iterations: 3h 9m 39s. [2025-11-13 09:29:06,216][__main__][INFO] - Starting iteration 236. [2025-11-13 09:29:06,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:06,219][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:14,308][__main__][INFO] - Number of regex retries in iteration 236: 0 [2025-11-13 09:29:14,308][__main__][INFO] - agents played in iteration 236 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:29:14,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:14,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:14,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:14,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:14,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:14,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:15,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:16,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:18,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:18,496][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:18,820][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:19,143][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:19,791][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:20,437][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:20,762][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:21,412][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:21,737][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:22,061][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:22,710][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:23,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:23,360][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:23,683][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:24,658][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:24,983][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:25,306][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:25,957][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:26,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:27,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:27,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:27,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:28,406][__main__][INFO] - Iteration 237 took 22s (36.46% Gen, 59.11% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 3m 43s. Estimated total time: 18h 29m 22s. Time estimates for 10 more iterations: 3m 41s, 100 more iterations: 36m 58s, 500 more iterations: 3h 4m 53s. [2025-11-13 09:29:28,408][__main__][INFO] - Starting iteration 237. [2025-11-13 09:29:28,411][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:28,411][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:36,664][__main__][INFO] - Number of regex retries in iteration 237: 0 [2025-11-13 09:29:36,665][__main__][INFO] - agents played in iteration 237 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:29:37,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:37,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:37,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:37,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:37,232][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:37,232][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:29:37,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:29:38,249][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:29:38,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:29:38,899][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:29:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:29:39,552][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:29:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:29:40,201][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:29:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:29:40,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:29:41,173][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:29:41,496][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:29:41,821][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:29:42,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:29:42,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:29:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:29:43,120][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:29:43,443][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:29:43,767][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:29:44,091][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:29:44,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:29:44,737][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:29:45,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:29:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:29:45,708][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:29:46,033][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:29:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:29:46,682][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:29:47,008][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:29:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:29:47,658][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:29:47,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:29:48,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:29:49,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:29:49,763][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:49,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:49,766][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:50,759][__main__][INFO] - Iteration 238 took 22s (36.93% Gen, 58.62% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 11m 25s. Estimated total time: 18h 37m 26s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 14s, 500 more iterations: 3h 6m 14s. [2025-11-13 09:29:50,761][__main__][INFO] - Starting iteration 238. [2025-11-13 09:29:50,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:29:50,765][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:58,871][__main__][INFO] - Number of regex retries in iteration 238: 0 [2025-11-13 09:29:58,872][__main__][INFO] - agents played in iteration 238 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:29:59,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:59,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:59,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:59,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:29:59,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:29:59,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:00,338][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:01,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:01,609][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:01,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:02,264][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:02,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:03,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:03,889][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:04,863][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:05,514][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:05,839][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:06,163][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:06,489][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:06,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:07,135][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:07,458][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:07,782][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:08,432][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:08,756][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:09,081][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:09,406][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:09,731][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:10,057][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:10,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:10,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:11,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:12,163][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:12,165][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:12,166][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:13,131][__main__][INFO] - Iteration 239 took 22s (36.24% Gen, 59.43% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 12m 1s. Estimated total time: 18h 38m 25s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 16s, 500 more iterations: 3h 6m 24s. [2025-11-13 09:30:13,134][__main__][INFO] - Starting iteration 239. [2025-11-13 09:30:13,137][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:13,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:21,633][__main__][INFO] - Number of regex retries in iteration 239: 0 [2025-11-13 09:30:21,633][__main__][INFO] - agents played in iteration 239 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:30:22,080][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:22,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:22,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:22,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:22,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:22,181][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:23,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:23,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:24,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:24,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:24,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:25,103][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:25,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:25,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:26,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:26,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:26,730][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:27,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:28,355][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:29,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:29,330][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:29,655][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:29,979][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:30,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:30,628][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:30,952][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:31,276][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:31,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:31,925][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:32,249][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:32,576][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:32,900][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:33,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:33,936][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:34,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:34,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:34,672][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:35,634][__main__][INFO] - Iteration 240 took 22s (37.76% Gen, 57.96% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 18m 6s. Estimated total time: 18h 44m 53s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 29s, 500 more iterations: 3h 7m 28s. [2025-11-13 09:30:35,636][__main__][INFO] - Starting iteration 240. [2025-11-13 09:30:35,639][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 09:30:35,640][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:30:43,739][__main__][INFO] - Number of regex retries in iteration 240: 0 [2025-11-13 09:30:43,740][__main__][INFO] - agents played in iteration 240 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:30:44,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:44,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:44,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:44,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:44,283][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:44,283][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:44,967][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:30:45,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:30:45,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:30:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:30:46,237][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:30:46,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:30:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:30:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:30:47,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:30:47,870][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:30:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:30:48,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:30:48,853][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:30:49,177][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:30:49,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:30:49,826][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:30:50,151][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:30:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:30:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:30:51,125][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:30:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:30:51,772][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:30:52,096][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:30:52,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:30:52,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:30:53,070][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:30:53,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:30:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:30:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:30:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:30:54,690][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:30:55,014][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:30:55,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:30:56,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:30:56,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:56,783][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:56,785][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:58,887][__main__][INFO] - Iteration 241 took 23s (34.84% Gen, 56.11% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 55m 16s. Estimated total time: 19h 22m 25s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 44s, 500 more iterations: 3h 13m 44s. [2025-11-13 09:30:58,889][__main__][INFO] - Starting iteration 241. [2025-11-13 09:30:58,892][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:30:58,893][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:07,550][__main__][INFO] - Number of regex retries in iteration 241: 0 [2025-11-13 09:31:07,550][__main__][INFO] - agents played in iteration 241 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:31:07,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:08,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:08,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:08,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:08,096][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:08,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:08,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:09,388][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:09,713][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:10,038][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:10,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:11,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:11,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:12,327][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:12,653][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:12,981][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:13,308][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:13,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:14,282][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:14,606][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:15,583][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:15,907][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:16,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:16,878][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:17,203][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:17,851][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:18,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:18,499][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:19,146][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:19,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:20,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:20,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:20,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:21,626][__main__][INFO] - Iteration 242 took 22s (38.08% Gen, 57.40% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 29m 11s. Estimated total time: 18h 56m 44s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 27s. [2025-11-13 09:31:21,629][__main__][INFO] - Starting iteration 242. [2025-11-13 09:31:21,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:21,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:29,869][__main__][INFO] - Number of regex retries in iteration 242: 0 [2025-11-13 09:31:29,870][__main__][INFO] - agents played in iteration 242 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:31:30,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:30,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:30,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:30,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:30,412][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:30,413][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:31,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:31,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:32,348][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:32,998][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:33,650][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:33,974][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:34,299][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:34,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:35,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:35,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:36,249][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:36,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:37,219][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:31:37,543][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:31:37,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:31:38,194][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:31:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:31:38,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:31:39,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:31:39,493][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:31:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:31:40,141][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:31:40,467][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:31:40,792][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:31:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:31:41,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:31:42,161][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:31:42,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:42,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:42,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:43,862][__main__][INFO] - Iteration 243 took 22s (37.06% Gen, 58.57% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 3m 37s. Estimated total time: 18h 31m 32s. Time estimates for 10 more iterations: 3m 42s, 100 more iterations: 37m 3s, 500 more iterations: 3h 5m 15s. [2025-11-13 09:31:43,867][__main__][INFO] - Starting iteration 243. [2025-11-13 09:31:43,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:31:43,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:52,534][__main__][INFO] - Number of regex retries in iteration 243: 0 [2025-11-13 09:31:52,535][__main__][INFO] - agents played in iteration 243 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:31:52,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:53,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:53,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:53,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:53,076][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:53,077][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:53,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:31:54,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:31:54,360][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:31:54,684][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:31:55,010][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:31:55,335][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:31:55,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:31:55,986][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:31:56,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:31:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:31:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:31:57,289][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:31:57,614][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:31:57,938][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:31:58,263][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:31:58,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:31:58,920][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:31:59,249][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:31:59,573][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:31:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:00,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:00,872][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:01,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:01,523][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:01,847][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:02,172][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:02,820][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:03,144][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:03,792][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:04,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:04,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:05,537][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:05,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:05,542][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:06,769][__main__][INFO] - Iteration 244 took 22s (37.83% Gen, 56.80% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 36m 41s. Estimated total time: 19h 4m 58s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 49s. [2025-11-13 09:32:06,771][__main__][INFO] - Starting iteration 244. [2025-11-13 09:32:06,774][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:06,775][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:14,708][__main__][INFO] - Number of regex retries in iteration 244: 0 [2025-11-13 09:32:14,708][__main__][INFO] - agents played in iteration 244 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:32:15,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:15,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:15,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:15,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:16,543][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:16,868][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:17,515][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:18,162][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:18,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:19,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:19,779][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:20,103][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:20,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:21,080][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:21,406][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:22,063][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:23,043][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:23,367][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:23,691][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:24,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:24,989][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:25,638][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:25,964][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:26,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:27,002][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:27,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:27,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:27,719][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:28,607][__main__][INFO] - Iteration 245 took 21s (36.34% Gen, 59.59% Train). Generation: 7s, Training: 13s. Estimated remaining time: 16h 42m 59s. Estimated total time: 18h 11m 39s. Time estimates for 10 more iterations: 3m 38s, 100 more iterations: 36m 23s, 500 more iterations: 3h 1m 56s. [2025-11-13 09:32:28,608][__main__][INFO] - Starting iteration 245. [2025-11-13 09:32:28,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:28,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:37,161][__main__][INFO] - Number of regex retries in iteration 245: 0 [2025-11-13 09:32:37,161][__main__][INFO] - agents played in iteration 245 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:32:37,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:37,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:37,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:37,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:37,704][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:37,705][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:38,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:32:38,670][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:32:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:32:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:32:39,647][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:32:39,975][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:32:40,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:32:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:32:40,943][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:32:41,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:32:41,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:32:41,915][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:32:42,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:32:42,562][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:32:42,886][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:32:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:32:43,536][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:32:43,864][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:32:44,193][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:32:44,518][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:32:44,846][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:32:45,173][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:32:45,500][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:32:45,831][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:32:46,156][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:32:46,481][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:32:46,806][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:32:47,129][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:32:47,453][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:32:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:32:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:32:48,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:32:48,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:32:49,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:32:50,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:50,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:50,193][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:51,194][__main__][INFO] - Iteration 246 took 22s (37.85% Gen, 57.71% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 20m 8s. Estimated total time: 18h 49m 10s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 11s. [2025-11-13 09:32:51,196][__main__][INFO] - Starting iteration 246. [2025-11-13 09:32:51,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:32:51,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:59,824][__main__][INFO] - Number of regex retries in iteration 246: 0 [2025-11-13 09:32:59,824][__main__][INFO] - agents played in iteration 246 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:33:00,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:00,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:00,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:00,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:00,375][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:00,376][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:01,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:01,341][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:01,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:02,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:02,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:02,963][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:03,287][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:03,614][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:03,944][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:04,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:04,593][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:04,916][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:05,889][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:06,864][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:07,190][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:07,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:07,838][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:08,498][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:08,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:09,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:09,473][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:09,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:10,124][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:10,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:10,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:11,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:12,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:12,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:12,846][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:12,848][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:13,841][__main__][INFO] - Iteration 247 took 22s (38.09% Gen, 57.52% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 22m 43s. Estimated total time: 18h 52m 7s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 44s, 500 more iterations: 3h 8m 41s. [2025-11-13 09:33:13,843][__main__][INFO] - Starting iteration 247. [2025-11-13 09:33:13,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:13,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:22,617][__main__][INFO] - Number of regex retries in iteration 247: 0 [2025-11-13 09:33:22,618][__main__][INFO] - agents played in iteration 247 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:33:23,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:23,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:23,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:23,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:23,165][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:23,165][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:23,843][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:24,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:24,461][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:25,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:25,434][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:26,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:26,405][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:26,727][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:27,050][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:27,373][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:27,696][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:28,019][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:28,343][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:28,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:29,313][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:29,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:30,287][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:31,262][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:31,587][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:31,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:32,237][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:32,562][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:32,886][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:33,211][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:33,536][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:33,860][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:34,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:34,903][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:35,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:35,611][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:35,612][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:36,617][__main__][INFO] - Iteration 248 took 22s (38.52% Gen, 57.06% Train). Generation: 8s, Training: 12s. Estimated remaining time: 17h 28m 51s. Estimated total time: 18h 58m 39s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 46s. [2025-11-13 09:33:36,619][__main__][INFO] - Starting iteration 248. [2025-11-13 09:33:36,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:36,622][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:45,303][__main__][INFO] - Number of regex retries in iteration 248: 0 [2025-11-13 09:33:45,304][__main__][INFO] - agents played in iteration 248 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:33:45,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:45,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:45,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:45,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:45,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:45,854][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:46,529][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:33:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:33:47,146][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:33:47,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:33:47,794][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:33:48,118][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:33:48,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:33:48,764][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:33:49,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:33:49,411][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:33:49,735][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:33:50,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:33:50,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:33:50,708][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:33:51,031][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:33:51,355][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:33:51,681][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:33:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:33:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:33:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:33:52,982][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:33:53,309][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:33:53,637][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:33:53,963][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:33:54,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:33:54,613][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:33:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:33:55,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:33:55,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:33:55,911][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:33:56,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:33:56,561][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:33:56,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:33:57,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:33:58,305][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:33:58,307][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:33:58,308][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:33:59,253][__main__][INFO] - Iteration 249 took 22s (38.36% Gen, 57.46% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 21m 25s. Estimated total time: 18h 51m 35s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 43s, 500 more iterations: 3h 8m 35s. [2025-11-13 09:33:59,255][__main__][INFO] - Starting iteration 249. [2025-11-13 09:33:59,257][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:33:59,258][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:08,220][__main__][INFO] - Number of regex retries in iteration 249: 0 [2025-11-13 09:34:08,220][__main__][INFO] - agents played in iteration 249 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:34:08,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:08,701][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:08,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:08,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:08,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:08,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:09,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:09,733][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:10,056][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:10,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:11,032][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:11,355][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:11,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:12,000][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:12,323][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:12,648][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:12,972][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:13,296][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:13,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:13,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:14,267][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:15,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:15,573][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:15,904][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:16,234][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:16,558][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:16,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:17,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:17,538][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:18,192][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:18,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:19,165][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:19,489][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:19,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:20,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:21,229][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:21,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:21,232][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:22,138][__main__][INFO] - Iteration 250 took 22s (39.17% Gen, 56.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 33m 32s. Estimated total time: 19h 4m 5s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 8s, 500 more iterations: 3h 10m 40s. [2025-11-13 09:34:22,140][__main__][INFO] - Starting iteration 250. [2025-11-13 09:34:22,143][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 09:34:22,144][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:31,071][__main__][INFO] - Number of regex retries in iteration 250: 0 [2025-11-13 09:34:31,071][__main__][INFO] - agents played in iteration 250 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:34:31,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:31,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:31,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:31,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:31,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:31,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:32,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:32,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:32,909][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:33,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:33,556][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:33,878][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:34,206][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:34,528][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:34,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:35,183][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:34:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:34:35,830][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:34:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:34:36,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:34:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:34:37,130][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:34:37,452][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:34:37,776][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:34:38,107][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:34:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:34:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:34:39,087][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:34:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:34:39,737][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:34:40,064][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:34:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:34:40,719][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:34:41,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:34:41,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:34:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:34:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:34:42,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:34:42,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:34:43,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:34:44,096][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:44,098][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:44,099][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:45,814][__main__][INFO] - Iteration 251 took 23s (37.71% Gen, 55.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 12m 37s. Estimated total time: 19h 43m 34s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 15s. [2025-11-13 09:34:45,816][__main__][INFO] - Starting iteration 251. [2025-11-13 09:34:45,819][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:34:45,819][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:55,584][__main__][INFO] - Number of regex retries in iteration 251: 0 [2025-11-13 09:34:55,585][__main__][INFO] - agents played in iteration 251 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:34:56,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:56,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:56,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:56,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:56,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:56,132][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:34:57,096][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:34:57,420][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:34:57,743][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:34:58,071][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:34:58,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:34:58,720][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:34:59,042][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:34:59,365][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:34:59,688][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:00,013][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:00,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:00,658][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:00,988][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:01,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:01,966][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:02,293][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:02,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:02,946][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:03,271][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:03,597][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:03,922][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:04,252][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:05,888][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:06,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:06,862][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:07,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:07,907][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:08,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:08,616][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:08,617][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:09,795][__main__][INFO] - Iteration 252 took 23s (40.73% Gen, 54.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 31s. Estimated total time: 19h 58m 52s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 48s. [2025-11-13 09:35:09,797][__main__][INFO] - Starting iteration 252. [2025-11-13 09:35:09,800][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:09,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:18,661][__main__][INFO] - Number of regex retries in iteration 252: 0 [2025-11-13 09:35:18,662][__main__][INFO] - agents played in iteration 252 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:35:19,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:19,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:19,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:19,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:19,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:19,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:19,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:20,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:20,536][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:21,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:21,836][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:22,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:22,482][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:23,452][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:24,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:24,424][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:25,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:25,398][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:25,723][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:26,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:26,707][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:27,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:27,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:28,020][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:28,349][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:28,672][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:28,998][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:29,646][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:30,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:31,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:31,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:31,708][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:31,709][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:32,594][__main__][INFO] - Iteration 253 took 22s (38.87% Gen, 57.24% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 28m 1s. Estimated total time: 18h 59m 44s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 59s, 500 more iterations: 3h 9m 57s. [2025-11-13 09:35:32,596][__main__][INFO] - Starting iteration 253. [2025-11-13 09:35:32,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:32,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:42,425][__main__][INFO] - Number of regex retries in iteration 253: 0 [2025-11-13 09:35:42,426][__main__][INFO] - agents played in iteration 253 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:35:42,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,950][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:42,984][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:42,984][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:43,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:35:43,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:35:44,283][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:35:44,606][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:35:44,935][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:35:45,265][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:35:45,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:35:45,917][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:35:46,240][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:35:46,565][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:35:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:35:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:35:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:35:47,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:35:48,185][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:35:48,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:35:48,838][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:35:49,165][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:35:49,490][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:35:49,814][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:35:50,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:35:50,464][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:35:50,789][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:35:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:35:51,450][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:35:51,775][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:35:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:35:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:35:52,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:35:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:35:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:35:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:35:54,051][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:35:54,785][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:35:55,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:55,465][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:55,466][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:56,544][__main__][INFO] - Iteration 254 took 23s (41.04% Gen, 54.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 13s. Estimated total time: 19h 57m 21s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 33s. [2025-11-13 09:35:56,546][__main__][INFO] - Starting iteration 254. [2025-11-13 09:35:56,549][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:35:56,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:05,953][__main__][INFO] - Number of regex retries in iteration 254: 0 [2025-11-13 09:36:05,953][__main__][INFO] - agents played in iteration 254 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:36:06,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:06,511][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:06,511][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:07,482][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:07,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:08,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:09,109][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:09,438][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:09,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:10,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:10,414][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:10,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:11,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:11,383][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:11,706][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:12,032][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:12,359][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:12,692][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:13,020][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:13,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:13,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:13,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:15,942][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:16,266][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:16,590][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:16,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:17,236][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:17,561][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:18,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:18,972][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:18,973][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:18,975][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:19,869][__main__][INFO] - Iteration 255 took 23s (40.32% Gen, 55.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 53m 32s. Estimated total time: 19h 26m 3s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 20s. [2025-11-13 09:36:19,871][__main__][INFO] - Starting iteration 255. [2025-11-13 09:36:19,874][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:19,874][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:29,388][__main__][INFO] - Number of regex retries in iteration 255: 0 [2025-11-13 09:36:29,389][__main__][INFO] - agents played in iteration 255 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:36:29,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:29,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:29,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:29,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:29,957][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:29,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:30,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:30,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:31,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:31,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:31,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:32,222][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:32,868][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:33,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:33,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:33,836][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:34,159][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:34,483][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:35,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:35,453][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:36,101][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:36,751][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:36:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:36:37,403][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:36:37,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:36:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:36:38,377][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:36:38,703][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:36:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:36:39,352][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:36:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:36:40,001][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:36:40,326][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:36:40,652][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:36:40,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:36:41,697][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:36:42,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:42,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:42,379][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:43,278][__main__][INFO] - Iteration 256 took 23s (40.65% Gen, 55.50% Train). Generation: 9s, Training: 12s. Estimated remaining time: 17h 57m 23s. Estimated total time: 19h 30m 17s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 2s. [2025-11-13 09:36:43,281][__main__][INFO] - Starting iteration 256. [2025-11-13 09:36:43,283][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:36:43,284][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:52,442][__main__][INFO] - Number of regex retries in iteration 256: 0 [2025-11-13 09:36:52,442][__main__][INFO] - agents played in iteration 256 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:36:52,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:52,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:52,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:53,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:53,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:53,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:53,703][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:36:53,996][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:36:54,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:36:54,645][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:36:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:36:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:36:55,619][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:36:55,942][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:36:56,266][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:36:56,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:36:56,912][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:36:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:36:57,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:36:57,884][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:36:58,210][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:36:58,534][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:36:58,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:36:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:36:59,507][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:36:59,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:00,156][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:00,481][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:01,134][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:01,785][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:02,755][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:03,081][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:03,731][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:04,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:04,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:05,473][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:05,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:05,477][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:06,385][__main__][INFO] - Iteration 257 took 23s (39.64% Gen, 56.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 50s. Estimated total time: 19h 15m 7s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 30s, 500 more iterations: 3h 12m 31s. [2025-11-13 09:37:06,387][__main__][INFO] - Starting iteration 257. [2025-11-13 09:37:06,390][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:06,390][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:15,478][__main__][INFO] - Number of regex retries in iteration 257: 0 [2025-11-13 09:37:15,479][__main__][INFO] - agents played in iteration 257 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:37:15,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:15,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:15,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:16,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:16,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:16,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:16,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:17,674][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:18,001][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:18,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:18,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:18,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:19,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:20,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:20,592][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:20,918][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:21,898][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:22,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:22,873][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:23,846][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:24,169][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:24,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:25,141][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:25,466][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:26,766][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:27,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:27,817][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:28,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:28,518][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:28,519][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:29,434][__main__][INFO] - Iteration 258 took 23s (39.44% Gen, 56.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 36s. Estimated total time: 19h 12m 16s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 2s. [2025-11-13 09:37:29,436][__main__][INFO] - Starting iteration 258. [2025-11-13 09:37:29,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:29,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:37:38,561][__main__][INFO] - Number of regex retries in iteration 258: 0 [2025-11-13 09:37:38,562][__main__][INFO] - agents played in iteration 258 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:37:39,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:37:39,112][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:37:39,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:37:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:37:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:37:40,427][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:37:40,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:37:41,073][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:37:41,397][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:37:41,720][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:37:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:37:42,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:37:42,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:37:43,022][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:37:43,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:37:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:37:43,992][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:37:44,317][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:37:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:37:44,967][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:37:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:37:45,615][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:37:45,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:37:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:37:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:37:46,916][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:37:47,241][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:37:47,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:37:47,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:37:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:37:48,541][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:37:48,867][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:37:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:37:49,516][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:37:49,845][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:37:50,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:37:50,884][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:37:51,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:51,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:51,582][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:52,570][__main__][INFO] - Iteration 259 took 23s (39.44% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 32s. Estimated total time: 19h 16m 35s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 45s. [2025-11-13 09:37:52,572][__main__][INFO] - Starting iteration 259. [2025-11-13 09:37:52,575][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:37:52,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:01,960][__main__][INFO] - Number of regex retries in iteration 259: 0 [2025-11-13 09:38:01,960][__main__][INFO] - agents played in iteration 259 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:38:02,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:02,522][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:02,522][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:03,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:04,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:04,834][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:05,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:05,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:07,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:07,753][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:08,078][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:08,403][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:08,727][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:09,373][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:09,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:10,347][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:10,672][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:10,998][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:11,323][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:11,647][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:11,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:12,297][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:12,622][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:13,272][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:13,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:14,328][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:15,047][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:15,049][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:15,050][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:15,946][__main__][INFO] - Iteration 260 took 23s (40.15% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 11s. Estimated total time: 19h 28m 38s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 46s. [2025-11-13 09:38:15,949][__main__][INFO] - Starting iteration 260. [2025-11-13 09:38:15,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 09:38:15,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:25,277][__main__][INFO] - Number of regex retries in iteration 260: 0 [2025-11-13 09:38:25,278][__main__][INFO] - agents played in iteration 260 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:38:25,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:25,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:25,838][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:26,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:27,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:27,498][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:27,820][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:28,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:28,468][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:29,119][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:29,444][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:29,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:30,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:30,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:31,070][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:31,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:32,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:33,014][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:33,338][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:33,663][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:33,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:34,645][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:34,971][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:35,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:38:35,946][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:38:36,270][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:38:36,594][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:38:36,918][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:38:37,636][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:38:38,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:38,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:38,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:40,087][__main__][INFO] - Iteration 261 took 24s (38.64% Gen, 54.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 31m 57s. Estimated total time: 20h 6m 48s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 13s, 500 more iterations: 3h 21m 8s. [2025-11-13 09:38:40,090][__main__][INFO] - Starting iteration 261. [2025-11-13 09:38:40,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:38:40,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:49,634][__main__][INFO] - Number of regex retries in iteration 261: 0 [2025-11-13 09:38:49,635][__main__][INFO] - agents played in iteration 261 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:38:50,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:50,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:50,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:50,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:50,196][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:50,197][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:38:51,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:38:51,493][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:38:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:38:52,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:38:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:38:52,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:38:53,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:38:53,443][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:38:53,769][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:38:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:38:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:38:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:38:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:38:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:38:55,725][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:38:56,049][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:38:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:38:56,699][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:38:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:38:57,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:38:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:38:57,999][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:38:58,327][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:38:58,652][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:38:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:38:59,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:38:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:38:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:00,925][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:01,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:01,971][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:02,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:02,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:02,678][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:03,886][__main__][INFO] - Iteration 262 took 23s (40.10% Gen, 54.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 14m 29s. Estimated total time: 19h 49m 44s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 17s. [2025-11-13 09:39:03,888][__main__][INFO] - Starting iteration 262. [2025-11-13 09:39:03,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:03,892][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:13,087][__main__][INFO] - Number of regex retries in iteration 262: 0 [2025-11-13 09:39:13,087][__main__][INFO] - agents played in iteration 262 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:39:13,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:13,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:14,361][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:14,656][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:14,981][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:15,306][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:15,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:17,261][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:17,589][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:17,912][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:19,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:19,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:19,863][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:20,189][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:20,514][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:20,837][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:21,161][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:21,487][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:21,813][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:22,137][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:22,463][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:23,113][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:23,762][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:24,085][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:24,733][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:25,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:26,156][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:26,158][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:26,159][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:27,105][__main__][INFO] - Iteration 263 took 23s (39.61% Gen, 56.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 8s. Estimated total time: 19h 20m 46s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 27s. [2025-11-13 09:39:27,108][__main__][INFO] - Starting iteration 263. [2025-11-13 09:39:27,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:27,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:36,127][__main__][INFO] - Number of regex retries in iteration 263: 0 [2025-11-13 09:39:36,127][__main__][INFO] - agents played in iteration 263 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:39:36,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:36,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:36,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:37,414][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:39:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:39:38,029][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:39:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:39:38,680][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:39:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:39:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:39:39,654][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:39:39,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:39:40,306][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:39:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:39:40,965][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:39:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:39:41,615][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:39:41,938][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:39:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:39:42,585][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:39:42,910][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:39:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:39:43,559][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:39:43,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:39:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:39:44,536][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:39:44,863][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:39:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:39:45,510][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:39:45,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:39:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:39:46,487][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:39:46,814][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:39:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:39:47,464][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:39:47,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:39:48,512][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:39:49,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:49,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:49,211][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:50,172][__main__][INFO] - Iteration 264 took 23s (39.09% Gen, 56.74% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 4s. Estimated total time: 19h 13m 5s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 10s. [2025-11-13 09:39:50,174][__main__][INFO] - Starting iteration 264. [2025-11-13 09:39:50,178][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:39:50,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:59,146][__main__][INFO] - Number of regex retries in iteration 264: 0 [2025-11-13 09:39:59,146][__main__][INFO] - agents played in iteration 264 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:39:59,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:59,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:59,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:59,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:59,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:59,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:00,686][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:01,658][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:01,980][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:02,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:02,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:03,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:03,608][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:04,579][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:04,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:05,230][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:05,555][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:05,879][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:06,205][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:06,529][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:06,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:07,178][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:07,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:08,161][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:08,810][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:09,456][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:09,782][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:10,433][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:10,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:11,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:12,165][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:12,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:12,167][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:13,175][__main__][INFO] - Iteration 265 took 22s (38.99% Gen, 56.62% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 33m 30s. Estimated total time: 19h 9m 54s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 19s, 500 more iterations: 3h 11m 39s. [2025-11-13 09:40:13,177][__main__][INFO] - Starting iteration 265. [2025-11-13 09:40:13,180][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:13,181][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:22,327][__main__][INFO] - Number of regex retries in iteration 265: 0 [2025-11-13 09:40:22,328][__main__][INFO] - agents played in iteration 265 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:40:22,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:22,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:22,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:23,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:24,216][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:24,539][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:24,864][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:25,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:26,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:26,811][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:27,135][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:27,458][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:27,781][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:28,106][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:28,432][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:28,756][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:29,081][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:29,733][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:30,057][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:30,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:31,034][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:31,365][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:32,013][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:32,338][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:32,987][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:33,312][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:33,637][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:33,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:34,690][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:35,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:35,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:35,404][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:36,414][__main__][INFO] - Iteration 266 took 23s (39.37% Gen, 56.28% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 55s. Estimated total time: 19h 21m 42s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 37s. [2025-11-13 09:40:36,416][__main__][INFO] - Starting iteration 266. [2025-11-13 09:40:36,419][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:36,419][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:45,362][__main__][INFO] - Number of regex retries in iteration 266: 0 [2025-11-13 09:40:45,363][__main__][INFO] - agents played in iteration 266 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:40:45,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:45,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:45,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:45,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:45,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:45,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:46,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:40:46,935][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:40:47,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:40:47,585][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:40:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:40:48,234][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:40:48,557][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:40:48,886][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:40:49,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:40:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:40:49,863][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:40:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:40:50,510][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:40:50,833][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:40:51,161][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:40:51,486][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:40:51,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:40:52,134][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:40:52,459][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:40:52,783][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:40:53,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:40:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:40:53,767][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:40:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:40:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:40:54,738][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:40:55,064][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:40:55,387][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:40:55,713][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:40:56,037][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:40:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:40:56,686][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:40:57,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:40:57,726][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:40:58,432][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:58,434][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:58,436][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:59,456][__main__][INFO] - Iteration 267 took 23s (38.82% Gen, 56.75% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 34m 43s. Estimated total time: 19h 11m 53s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 58s. [2025-11-13 09:40:59,458][__main__][INFO] - Starting iteration 267. [2025-11-13 09:40:59,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:40:59,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:08,205][__main__][INFO] - Number of regex retries in iteration 267: 0 [2025-11-13 09:41:08,205][__main__][INFO] - agents played in iteration 267 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:41:08,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:08,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:08,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:08,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:08,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:08,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:09,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:10,089][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:10,413][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:11,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:11,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:11,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:12,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:12,692][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:13,018][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:13,342][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:13,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:14,968][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:15,618][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:15,942][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:16,267][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:16,591][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:16,916][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:17,242][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:17,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:17,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:18,218][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:19,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:19,840][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:20,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:21,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:21,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:21,271][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:22,194][__main__][INFO] - Iteration 268 took 22s (38.46% Gen, 57.47% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 19m 10s. Estimated total time: 18h 56m 43s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 53s, 500 more iterations: 3h 9m 27s. [2025-11-13 09:41:22,196][__main__][INFO] - Starting iteration 268. [2025-11-13 09:41:22,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:22,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:31,564][__main__][INFO] - Number of regex retries in iteration 268: 0 [2025-11-13 09:41:31,564][__main__][INFO] - agents played in iteration 268 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:41:32,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:32,125][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:32,126][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:32,851][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:33,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:33,472][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:34,119][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:35,097][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:35,422][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:35,749][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:36,073][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:36,396][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:36,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:41:37,043][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:41:37,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:41:37,690][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:41:38,015][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:41:38,339][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:41:38,665][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:41:38,990][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:41:39,319][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:41:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:41:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:41:40,304][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:41:40,630][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:41:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:41:41,278][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:41:41,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:41:41,924][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:41:42,250][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:41:42,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:41:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:41:43,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:41:43,956][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:41:44,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:41:44,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:41:44,661][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:41:45,605][__main__][INFO] - Iteration 269 took 23s (40.01% Gen, 55.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 22s. Estimated total time: 19h 30m 19s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 0s, 500 more iterations: 3h 15m 3s. [2025-11-13 09:41:45,607][__main__][INFO] - Starting iteration 269. [2025-11-13 09:41:45,610][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:41:45,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:54,563][__main__][INFO] - Number of regex retries in iteration 269: 0 [2025-11-13 09:41:54,564][__main__][INFO] - agents played in iteration 269 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:41:55,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:55,125][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:55,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:55,849][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:41:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:41:56,473][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:41:56,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:41:57,128][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:41:57,453][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:41:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:41:58,100][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:41:58,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:41:58,749][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:41:59,074][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:41:59,401][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:41:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:00,053][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:00,377][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:00,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:01,027][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:02,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:02,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:02,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:03,314][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:03,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:03,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:04,613][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:04,937][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:05,262][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:05,588][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:05,913][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:06,237][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:06,960][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:07,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:07,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:07,678][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:08,734][__main__][INFO] - Iteration 270 took 23s (38.72% Gen, 56.71% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 37m 53s. Estimated total time: 19h 16m 13s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 42s. [2025-11-13 09:42:08,736][__main__][INFO] - Starting iteration 270. [2025-11-13 09:42:08,739][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 09:42:08,739][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:17,877][__main__][INFO] - Number of regex retries in iteration 270: 0 [2025-11-13 09:42:17,877][__main__][INFO] - agents played in iteration 270 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:42:18,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:18,441][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:18,442][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:19,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:19,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:19,782][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:20,109][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:20,433][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:21,080][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:21,729][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:22,052][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:22,376][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:22,700][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:23,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:23,675][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:24,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:25,296][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:25,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:26,596][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:26,919][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:27,244][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:27,569][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:28,218][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:28,545][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:28,869][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:29,194][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:29,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:30,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:30,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:30,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:30,982][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:33,230][__main__][INFO] - Iteration 271 took 24s (37.31% Gen, 53.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 51s. Estimated total time: 20h 24m 36s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 49s, 500 more iterations: 3h 24m 6s. [2025-11-13 09:42:33,232][__main__][INFO] - Starting iteration 271. [2025-11-13 09:42:33,234][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:42:33,235][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:42,401][__main__][INFO] - Number of regex retries in iteration 271: 0 [2025-11-13 09:42:42,402][__main__][INFO] - agents played in iteration 271 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:42:42,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:42,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:42,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:42,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:42,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:42,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:43,660][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:42:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:42:44,279][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:42:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:42:44,928][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:42:45,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:42:45,576][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:42:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:42:46,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:42:46,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:42:46,876][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:42:47,200][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:42:47,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:42:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:42:48,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:42:48,498][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:42:48,821][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:42:49,145][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:42:49,471][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:42:49,797][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:42:50,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:42:50,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:42:50,772][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:42:51,094][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:42:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:42:51,747][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:42:52,071][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:42:52,397][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:42:52,722][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:42:53,048][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:42:53,374][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:42:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:42:54,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:42:54,733][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:42:55,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:55,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:55,460][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:56,490][__main__][INFO] - Iteration 272 took 23s (39.42% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 42s. Estimated total time: 19h 22m 49s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 45s, 500 more iterations: 3h 13m 48s. [2025-11-13 09:42:56,492][__main__][INFO] - Starting iteration 272. [2025-11-13 09:42:56,495][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:42:56,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:05,788][__main__][INFO] - Number of regex retries in iteration 272: 0 [2025-11-13 09:43:05,789][__main__][INFO] - agents played in iteration 272 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:43:06,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:06,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:06,364][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:07,378][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:07,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:08,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:08,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:08,676][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:09,000][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:09,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:09,977][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:10,301][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:10,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:10,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:11,272][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:11,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:11,920][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:12,244][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:12,568][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:12,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:13,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:14,190][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:14,841][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:16,137][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:16,461][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:16,787][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:17,112][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:17,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:18,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:18,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:18,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:18,851][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:19,843][__main__][INFO] - Iteration 273 took 23s (39.80% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 47m 54s. Estimated total time: 19h 27m 25s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 34s. [2025-11-13 09:43:19,845][__main__][INFO] - Starting iteration 273. [2025-11-13 09:43:19,848][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:19,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:29,357][__main__][INFO] - Number of regex retries in iteration 273: 0 [2025-11-13 09:43:29,357][__main__][INFO] - agents played in iteration 273 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:43:29,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:29,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:29,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:31,289][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:31,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:32,263][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:32,589][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:32,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:33,889][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:34,214][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:34,538][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:34,862][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:35,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:35,511][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:36,487][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:36,813][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:37,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:43:37,463][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:43:37,789][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:43:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:43:38,437][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:43:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:43:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:43:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:43:39,740][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:43:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:43:40,388][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:43:40,714][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:43:41,039][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:43:41,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:43:42,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:42,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:42,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:43,547][__main__][INFO] - Iteration 274 took 23s (40.12% Gen, 55.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 5s. Estimated total time: 19h 44m 59s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 29s. [2025-11-13 09:43:43,549][__main__][INFO] - Starting iteration 274. [2025-11-13 09:43:43,552][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:43:43,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:52,216][__main__][INFO] - Number of regex retries in iteration 274: 0 [2025-11-13 09:43:52,217][__main__][INFO] - agents played in iteration 274 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:43:52,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:52,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:52,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:53,537][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:43:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:43:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:43:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:43:54,804][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:43:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:43:55,453][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:43:55,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:43:56,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:43:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:43:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:43:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:43:57,400][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:43:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:43:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:43:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:43:58,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:43:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:43:59,348][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:43:59,674][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:43:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:00,646][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:00,970][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:01,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:02,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:02,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:02,916][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:03,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:03,567][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:03,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:04,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:05,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:05,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:05,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:06,421][__main__][INFO] - Iteration 275 took 22s (37.88% Gen, 57.43% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 23m 10s. Estimated total time: 19h 3m 27s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 34s. [2025-11-13 09:44:06,423][__main__][INFO] - Starting iteration 275. [2025-11-13 09:44:06,426][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:06,426][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:15,122][__main__][INFO] - Number of regex retries in iteration 275: 0 [2025-11-13 09:44:15,123][__main__][INFO] - agents played in iteration 275 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:44:15,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:15,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:15,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:15,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:15,677][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:15,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:16,701][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:17,674][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:17,999][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:18,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:18,648][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:19,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:20,278][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:20,604][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:20,930][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:21,254][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:21,579][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:21,904][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:22,230][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:22,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:23,201][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:23,526][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:23,850][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:24,174][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:24,503][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:25,154][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:25,479][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:25,803][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:26,127][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:26,453][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:26,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:27,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:28,184][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:28,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:28,187][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:29,151][__main__][INFO] - Iteration 276 took 22s (38.27% Gen, 57.49% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 15m 39s. Estimated total time: 18h 56m 19s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 52s, 500 more iterations: 3h 9m 23s. [2025-11-13 09:44:29,154][__main__][INFO] - Starting iteration 276. [2025-11-13 09:44:29,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:29,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:38,692][__main__][INFO] - Number of regex retries in iteration 276: 0 [2025-11-13 09:44:38,692][__main__][INFO] - agents played in iteration 276 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:44:39,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:39,257][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:39,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:39,986][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:44:40,281][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:44:40,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:44:40,931][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:44:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:44:41,581][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:44:41,907][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:44:42,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:44:42,557][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:44:42,881][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:44:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:44:43,530][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:44:43,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:44:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:44:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:44:44,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:44:45,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:44:45,482][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:44:45,806][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:44:46,131][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:44:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:44:46,780][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:44:47,104][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:44:47,429][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:44:47,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:44:48,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:44:48,401][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:44:48,725][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:44:49,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:44:49,371][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:44:49,696][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:44:50,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:44:50,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:44:51,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:44:51,755][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:51,757][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:51,759][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:52,788][__main__][INFO] - Iteration 277 took 23s (40.35% Gen, 55.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 0m 30s. Estimated total time: 19h 41m 33s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 55s. [2025-11-13 09:44:52,790][__main__][INFO] - Starting iteration 277. [2025-11-13 09:44:52,793][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:44:52,794][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:02,203][__main__][INFO] - Number of regex retries in iteration 277: 0 [2025-11-13 09:45:02,204][__main__][INFO] - agents played in iteration 277 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:45:02,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:02,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:02,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:02,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:02,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:02,768][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:03,790][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:04,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:04,441][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:04,765][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:05,413][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:05,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:06,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:06,388][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:06,715][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:07,040][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:07,365][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:08,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:08,337][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:08,661][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:08,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:09,308][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:10,607][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:10,933][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:11,585][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:12,560][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:12,885][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:13,215][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:13,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:14,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:15,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:15,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:15,271][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:16,309][__main__][INFO] - Iteration 278 took 23s (40.01% Gen, 55.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 54m 24s. Estimated total time: 19h 35m 51s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 58s. [2025-11-13 09:45:16,311][__main__][INFO] - Starting iteration 278. [2025-11-13 09:45:16,314][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:16,315][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:25,742][__main__][INFO] - Number of regex retries in iteration 278: 0 [2025-11-13 09:45:25,743][__main__][INFO] - agents played in iteration 278 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:45:26,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:26,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:26,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:26,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:26,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:26,302][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:27,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:27,656][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:27,983][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:28,955][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:29,278][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:29,602][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:29,927][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:30,252][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:30,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:30,902][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:31,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:31,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:31,873][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:32,198][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:32,523][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:32,851][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:33,500][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:33,825][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:34,149][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:35,127][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:35,776][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:45:36,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:45:37,079][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:45:37,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:45:38,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:45:38,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:38,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:38,845][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:39,921][__main__][INFO] - Iteration 279 took 23s (39.94% Gen, 55.50% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 58m 30s. Estimated total time: 19h 40m 21s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 43s. [2025-11-13 09:45:39,923][__main__][INFO] - Starting iteration 279. [2025-11-13 09:45:39,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:45:39,927][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:49,199][__main__][INFO] - Number of regex retries in iteration 279: 0 [2025-11-13 09:45:49,200][__main__][INFO] - agents played in iteration 279 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:45:49,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:49,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:49,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:49,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:49,761][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:49,762][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:45:50,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:45:50,807][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:45:51,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:45:51,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:45:51,783][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:45:52,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:45:52,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:45:52,765][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:45:53,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:45:53,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:45:53,737][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:45:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:45:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:45:54,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:45:55,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:45:55,364][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:45:55,690][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:45:56,013][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:45:56,337][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:45:56,662][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:45:56,986][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:45:57,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:45:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:45:57,960][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:45:58,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:45:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:45:58,938][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:45:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:45:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:45:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:00,234][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:00,558][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:00,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:01,583][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:02,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:02,321][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:02,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:03,346][__main__][INFO] - Iteration 280 took 23s (39.59% Gen, 56.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 47s. Estimated total time: 19h 31m 1s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 10s. [2025-11-13 09:46:03,348][__main__][INFO] - Starting iteration 280. [2025-11-13 09:46:03,352][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 09:46:03,352][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:12,829][__main__][INFO] - Number of regex retries in iteration 280: 0 [2025-11-13 09:46:12,829][__main__][INFO] - agents played in iteration 280 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:46:13,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:13,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:13,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:14,141][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:14,437][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:14,763][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:15,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:15,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:16,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:16,715][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:17,039][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:17,364][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:17,687][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:18,014][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:18,339][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:18,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:19,312][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:19,965][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:20,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:20,941][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:21,265][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:21,590][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:22,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:22,568][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:22,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:23,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:23,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:23,861][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:24,186][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:24,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:25,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:25,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:25,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:25,963][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:27,954][__main__][INFO] - Iteration 281 took 24s (38.52% Gen, 53.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 47m 30s. Estimated total time: 20h 30m 9s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 0s, 500 more iterations: 3h 25m 1s. [2025-11-13 09:46:27,956][__main__][INFO] - Starting iteration 281. [2025-11-13 09:46:27,986][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:46:27,986][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:46:37,749][__main__][INFO] - Number of regex retries in iteration 281: 0 [2025-11-13 09:46:37,750][__main__][INFO] - agents played in iteration 281 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:46:38,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:38,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:38,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:38,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:46:38,320][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:46:38,321][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:39,065][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:46:39,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:46:39,684][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:46:40,008][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:46:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:46:40,658][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:46:40,989][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:46:41,313][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:46:41,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:46:41,966][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:46:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:46:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:46:42,937][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:46:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:46:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:46:43,911][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:46:44,237][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:46:44,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:46:44,887][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:46:45,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:46:45,535][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:46:45,861][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:46:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:46:46,510][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:46:46,836][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:46:47,161][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:46:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:46:47,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:46:48,140][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:46:48,465][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:46:48,788][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:46:49,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:46:49,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:46:50,145][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:46:50,877][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:50,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:50,880][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:51,840][__main__][INFO] - Iteration 282 took 23s (40.88% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 11m 4s. Estimated total time: 19h 54m 7s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 1s. [2025-11-13 09:46:51,842][__main__][INFO] - Starting iteration 282. [2025-11-13 09:46:51,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:46:51,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:01,549][__main__][INFO] - Number of regex retries in iteration 282: 0 [2025-11-13 09:47:01,550][__main__][INFO] - agents played in iteration 282 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:47:02,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,054][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:02,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:02,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:02,853][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:03,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:03,471][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:03,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:04,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:04,447][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:04,771][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:05,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:05,421][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:05,746][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:06,071][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:06,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:07,372][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:07,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:08,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:08,668][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:08,992][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:09,317][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:09,641][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:09,966][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:10,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:10,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:11,261][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:11,586][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:11,910][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:12,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:13,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:13,928][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:14,645][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:14,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:14,648][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:15,630][__main__][INFO] - Iteration 283 took 23s (40.80% Gen, 55.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 50s. Estimated total time: 19h 49m 17s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 12s. [2025-11-13 09:47:15,632][__main__][INFO] - Starting iteration 283. [2025-11-13 09:47:15,635][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:15,636][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:25,216][__main__][INFO] - Number of regex retries in iteration 283: 0 [2025-11-13 09:47:25,216][__main__][INFO] - agents played in iteration 283 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:47:25,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:25,781][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:25,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:26,521][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:26,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:27,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:27,465][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:27,789][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:28,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:28,440][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:29,089][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:29,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:30,067][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:30,391][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:30,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:31,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:32,014][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:32,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:32,665][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:32,989][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:33,311][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:33,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:34,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:34,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:35,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:35,588][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:47:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:47:36,566][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:47:36,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:47:37,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:47:38,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:38,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:38,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:39,351][__main__][INFO] - Iteration 284 took 23s (40.40% Gen, 55.22% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 1m 58s. Estimated total time: 19h 45m 48s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 38s. [2025-11-13 09:47:39,353][__main__][INFO] - Starting iteration 284. [2025-11-13 09:47:39,356][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:47:39,357][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:49,243][__main__][INFO] - Number of regex retries in iteration 284: 0 [2025-11-13 09:47:49,243][__main__][INFO] - agents played in iteration 284 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:47:49,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:49,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:49,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:49,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:49,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:49,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:50,541][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:47:50,834][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:47:51,159][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:47:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:47:51,809][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:47:52,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:47:52,462][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:47:52,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:47:53,110][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:47:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:47:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:47:54,088][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:47:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:47:54,742][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:47:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:47:55,390][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:47:55,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:47:56,042][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:47:56,366][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:47:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:47:57,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:47:57,348][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:47:57,673][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:47:58,003][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:47:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:47:58,652][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:47:58,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:47:59,303][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:47:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:47:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:00,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:00,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:01,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:02,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:02,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:02,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:03,262][__main__][INFO] - Iteration 285 took 23s (41.35% Gen, 54.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 11m 8s. Estimated total time: 19h 55m 22s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 13s. [2025-11-13 09:48:03,264][__main__][INFO] - Starting iteration 285. [2025-11-13 09:48:03,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:03,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:12,301][__main__][INFO] - Number of regex retries in iteration 285: 0 [2025-11-13 09:48:12,302][__main__][INFO] - agents played in iteration 285 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:48:12,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:12,860][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:12,861][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:13,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:14,537][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:14,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:15,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:15,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:16,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:16,813][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:17,137][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:17,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:17,791][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:18,116][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:18,442][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:18,767][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:19,094][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:19,421][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:19,746][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:20,071][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:20,395][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:20,720][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:21,050][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:21,379][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:21,706][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:22,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:23,020][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:23,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:23,669][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:23,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:24,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:25,429][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:25,430][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:25,432][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:26,442][__main__][INFO] - Iteration 286 took 23s (38.98% Gen, 56.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 34m 10s. Estimated total time: 19h 18m 48s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 8s. [2025-11-13 09:48:26,445][__main__][INFO] - Starting iteration 286. [2025-11-13 09:48:26,448][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:26,449][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:35,615][__main__][INFO] - Number of regex retries in iteration 286: 0 [2025-11-13 09:48:35,616][__main__][INFO] - agents played in iteration 286 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:48:36,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:36,185][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:36,185][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:36,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:48:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:48:37,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:48:37,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:48:38,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:48:38,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:48:38,852][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:48:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:48:39,506][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:48:39,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:48:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:48:40,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:48:40,809][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:48:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:48:41,460][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:48:41,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:48:42,115][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:48:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:48:42,767][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:48:43,094][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:48:43,419][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:48:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:48:44,081][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:48:44,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:48:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:48:45,063][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:48:45,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:48:45,710][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:48:46,035][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:48:46,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:48:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:48:47,010][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:48:47,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:48:48,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:48:48,779][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:48,780][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:48,782][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:49,793][__main__][INFO] - Iteration 287 took 23s (39.26% Gen, 56.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 17s. Estimated total time: 19h 27m 18s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 33s. [2025-11-13 09:48:49,795][__main__][INFO] - Starting iteration 287. [2025-11-13 09:48:49,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:48:49,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:56,140][mllm.models.large_language_model_local][WARNING] - Response %A> did not match regex: (|), retry 1/1 [2025-11-13 09:48:59,199][__main__][INFO] - Number of regex retries in iteration 287: 1 [2025-11-13 09:48:59,200][__main__][INFO] - agents played in iteration 287 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:48:59,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:59,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:59,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:59,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:59,788][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:59,788][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:00,815][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:01,788][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:02,112][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:02,435][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:03,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:03,745][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:04,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:05,052][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:05,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:05,702][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:06,028][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:06,357][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:06,683][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:07,011][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:07,342][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:07,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:09,304][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:09,633][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:09,960][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:10,936][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:11,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:12,333][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:12,335][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:12,337][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:13,458][__main__][INFO] - Iteration 288 took 23s (39.73% Gen, 55.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 57m 35s. Estimated total time: 19h 42m 59s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 9s. [2025-11-13 09:49:13,460][__main__][INFO] - Starting iteration 288. [2025-11-13 09:49:13,463][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:13,463][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:23,588][__main__][INFO] - Number of regex retries in iteration 288: 0 [2025-11-13 09:49:23,588][__main__][INFO] - agents played in iteration 288 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:49:24,086][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:24,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:24,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:24,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:24,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:24,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:24,908][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:25,206][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:25,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:26,834][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:27,163][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:27,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:27,818][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:28,144][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:28,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:28,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:29,123][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:29,450][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:29,779][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:30,431][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:30,758][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:31,084][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:31,412][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:31,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:32,065][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:32,393][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:32,721][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:33,378][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:34,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:34,359][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:35,014][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:35,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:36,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:49:36,927][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:49:36,928][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:49:36,930][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:49:37,891][__main__][INFO] - Iteration 289 took 24s (41.45% Gen, 54.61% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 35m 38s. Estimated total time: 20h 21m 26s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 42s, 500 more iterations: 3h 23m 34s. [2025-11-13 09:49:37,893][__main__][INFO] - Starting iteration 289. [2025-11-13 09:49:37,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:49:37,896][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:47,514][__main__][INFO] - Number of regex retries in iteration 289: 0 [2025-11-13 09:49:47,515][__main__][INFO] - agents played in iteration 289 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:49:47,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:48,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:48,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:48,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:48,085][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:48,086][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:49:49,108][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:49:49,432][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:49:49,756][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:49:50,080][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:49:50,404][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:49:50,729][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:49:51,054][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:49:51,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:49:51,707][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:49:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:49:52,357][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:49:52,681][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:49:53,008][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:49:53,337][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:49:53,662][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:49:53,987][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:49:54,311][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:49:54,635][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:49:54,959][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:49:55,284][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:49:55,609][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:49:55,933][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:49:56,256][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:49:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:49:56,905][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:49:57,230][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:49:57,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:49:57,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:49:58,211][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:49:58,537][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:49:58,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:49:59,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:49:59,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:00,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:00,617][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:00,624][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:01,816][__main__][INFO] - Iteration 290 took 23s (40.21% Gen, 54.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 9m 51s. Estimated total time: 19h 56m 3s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 20s. [2025-11-13 09:50:01,818][__main__][INFO] - Starting iteration 290. [2025-11-13 09:50:01,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 09:50:01,822][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:11,381][__main__][INFO] - Number of regex retries in iteration 290: 0 [2025-11-13 09:50:11,382][__main__][INFO] - agents played in iteration 290 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:50:11,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:11,950][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:11,950][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:12,655][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:13,274][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:14,581][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:14,907][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:15,235][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:15,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:15,889][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:16,216][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:16,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:17,844][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:18,169][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:18,491][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:18,815][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:19,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:19,463][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:19,787][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:20,110][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:20,433][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:21,081][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:21,407][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:22,056][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:22,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:22,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:23,029][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:23,735][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:24,446][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:24,448][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:24,449][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:26,235][__main__][INFO] - Iteration 291 took 24s (39.15% Gen, 53.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 34m 7s. Estimated total time: 20h 20m 44s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 41s, 500 more iterations: 3h 23m 27s. [2025-11-13 09:50:26,238][__main__][INFO] - Starting iteration 291. [2025-11-13 09:50:26,241][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:26,241][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:36,058][__main__][INFO] - Number of regex retries in iteration 291: 0 [2025-11-13 09:50:36,059][__main__][INFO] - agents played in iteration 291 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:50:36,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:36,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:36,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:36,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:36,619][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:36,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:37,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:50:37,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:50:37,936][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:50:38,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:50:38,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:50:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:50:39,232][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:50:39,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:50:39,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:50:40,209][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:50:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:50:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:50:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:50:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:50:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:50:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:50:42,482][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:50:42,807][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:50:43,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:50:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:50:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:50:44,105][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:50:44,429][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:50:44,752][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:50:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:50:45,397][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:50:45,722][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:50:46,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:50:46,369][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:50:46,693][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:50:47,016][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:50:47,340][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:50:47,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:50:48,367][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:50:49,089][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:49,091][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:49,092][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:50,120][__main__][INFO] - Iteration 292 took 23s (41.11% Gen, 54.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 7m 0s. Estimated total time: 19h 54m 1s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 0s. [2025-11-13 09:50:50,122][__main__][INFO] - Starting iteration 292. [2025-11-13 09:50:50,126][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:50:50,126][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:59,498][__main__][INFO] - Number of regex retries in iteration 292: 0 [2025-11-13 09:50:59,499][__main__][INFO] - agents played in iteration 292 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:50:59,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:00,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:00,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:00,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:00,070][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:00,071][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:00,805][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:01,426][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:02,077][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:02,401][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:02,726][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:03,054][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:03,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:03,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:04,026][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:04,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:05,325][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:05,650][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:06,621][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:06,944][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:07,268][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:07,916][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:08,241][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:08,566][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:09,211][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:09,535][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:09,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:10,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:10,506][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:10,831][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:11,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:11,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:12,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:12,604][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:12,606][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:13,638][__main__][INFO] - Iteration 293 took 23s (39.86% Gen, 55.75% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 48m 14s. Estimated total time: 19h 35m 39s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 56s. [2025-11-13 09:51:13,640][__main__][INFO] - Starting iteration 293. [2025-11-13 09:51:13,643][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:13,644][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:22,667][__main__][INFO] - Number of regex retries in iteration 293: 0 [2025-11-13 09:51:22,667][__main__][INFO] - agents played in iteration 293 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:51:23,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:23,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:23,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:23,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:23,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:23,234][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:23,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:24,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:24,924][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:25,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:25,573][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:25,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:26,221][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:26,545][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:27,193][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:27,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:27,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:28,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:28,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:28,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:29,139][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:29,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:29,790][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:30,115][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:31,101][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:31,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:32,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:32,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:33,056][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:33,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:33,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:34,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:35,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:35,754][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:35,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:35,757][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:36,732][__main__][INFO] - Iteration 294 took 23s (39.08% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 26m 41s. Estimated total time: 19h 14m 28s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 24s. [2025-11-13 09:51:36,734][__main__][INFO] - Starting iteration 294. [2025-11-13 09:51:36,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:51:36,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:46,379][__main__][INFO] - Number of regex retries in iteration 294: 0 [2025-11-13 09:51:46,379][__main__][INFO] - agents played in iteration 294 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:51:46,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:46,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:46,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:46,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:46,945][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:46,945][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:47,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:51:47,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:51:48,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:51:48,627][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:51:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:51:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:51:49,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:51:49,925][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:51:50,250][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:51:50,574][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:51:50,899][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:51:51,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:51:51,546][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:51:51,872][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:51:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:51:52,523][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:51:52,848][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:51:53,172][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:51:53,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:51:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:51:54,150][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:51:54,475][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:51:54,804][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:51:55,133][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:51:55,457][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:51:55,780][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:51:56,105][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:51:56,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:51:56,756][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:51:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:51:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:51:57,732][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:51:58,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:51:58,732][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:51:59,458][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:59,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:59,461][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:00,488][__main__][INFO] - Iteration 295 took 23s (40.60% Gen, 55.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 23s. Estimated total time: 19h 47m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 55s. [2025-11-13 09:52:00,490][__main__][INFO] - Starting iteration 295. [2025-11-13 09:52:00,493][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:00,493][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:10,130][__main__][INFO] - Number of regex retries in iteration 295: 0 [2025-11-13 09:52:10,131][__main__][INFO] - agents played in iteration 295 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:52:10,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:10,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:10,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:10,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:10,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:10,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:11,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:11,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:12,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:12,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:13,022][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:13,345][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:13,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:13,994][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:14,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:15,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:16,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:16,591][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:17,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:17,562][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:18,216][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:18,541][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:18,866][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:19,189][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:19,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:19,838][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:20,161][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:20,810][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:21,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:21,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:22,477][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:23,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:23,199][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:23,201][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:24,203][__main__][INFO] - Iteration 296 took 23s (40.64% Gen, 55.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 56m 57s. Estimated total time: 19h 45m 32s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 35s. [2025-11-13 09:52:24,206][__main__][INFO] - Starting iteration 296. [2025-11-13 09:52:24,208][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:24,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:33,979][__main__][INFO] - Number of regex retries in iteration 296: 0 [2025-11-13 09:52:33,979][__main__][INFO] - agents played in iteration 296 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:52:34,460][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:34,562][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:34,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:35,590][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:35,914][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:52:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:52:36,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:52:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:52:37,530][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:52:37,853][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:52:38,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:52:38,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:52:38,826][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:52:39,151][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:52:39,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:52:39,803][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:52:40,127][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:52:40,452][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:52:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:52:41,104][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:52:41,429][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:52:41,755][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:52:42,078][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:52:42,405][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:52:42,731][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:52:43,057][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:52:43,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:52:43,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:52:44,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:52:44,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:52:44,684][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:52:45,008][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:52:45,331][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:52:45,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:52:46,366][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:52:47,090][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:47,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:47,093][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:48,124][__main__][INFO] - Iteration 297 took 23s (40.85% Gen, 54.83% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 6m 50s. Estimated total time: 19h 55m 49s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 18s. [2025-11-13 09:52:48,126][__main__][INFO] - Starting iteration 297. [2025-11-13 09:52:48,129][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:52:48,130][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:57,194][__main__][INFO] - Number of regex retries in iteration 297: 0 [2025-11-13 09:52:57,195][__main__][INFO] - agents played in iteration 297 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:52:57,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:58,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:58,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:58,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:58,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:58,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:58,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:52:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:52:59,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:52:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:00,124][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:01,423][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:01,747][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:02,073][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:02,402][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:03,054][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:03,377][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:03,702][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:04,352][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:04,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:05,006][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:05,330][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:05,977][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:06,300][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:06,625][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:06,947][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:07,917][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:08,246][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:08,574][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:09,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:09,932][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:10,655][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:10,657][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:10,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:11,628][__main__][INFO] - Iteration 298 took 23s (38.57% Gen, 57.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 45m 36s. Estimated total time: 19h 34m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 49s. [2025-11-13 09:53:11,630][__main__][INFO] - Starting iteration 298. [2025-11-13 09:53:11,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:11,634][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:20,828][__main__][INFO] - Number of regex retries in iteration 298: 0 [2025-11-13 09:53:20,828][__main__][INFO] - agents played in iteration 298 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:53:21,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:21,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:21,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:22,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:22,439][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:22,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:23,414][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:23,737][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:24,061][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:24,384][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:24,707][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:25,030][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:25,357][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:26,007][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:26,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:27,308][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:27,633][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:27,957][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:28,281][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:28,606][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:28,930][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:29,254][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:29,576][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:29,901][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:30,556][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:30,886][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:31,530][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:31,858][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:32,514][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:33,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:33,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:33,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:33,927][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:34,900][__main__][INFO] - Iteration 299 took 23s (39.52% Gen, 56.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 37s. Estimated total time: 19h 23m 22s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 53s. [2025-11-13 09:53:34,902][__main__][INFO] - Starting iteration 299. [2025-11-13 09:53:34,905][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:34,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:42,318][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 09:53:44,881][__main__][INFO] - Number of regex retries in iteration 299: 1 [2025-11-13 09:53:44,882][__main__][INFO] - agents played in iteration 299 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:53:45,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:45,468][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:45,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:53:46,501][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:53:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:53:47,149][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:53:47,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:53:47,799][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:53:48,123][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:53:48,448][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:53:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:53:49,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:53:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:53:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:53:50,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:53:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:53:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:53:51,046][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:53:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:53:51,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:53:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:53:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:53:52,669][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:53:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:53:53,315][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:53:53,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:53:53,962][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:53:54,286][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:53:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:53:54,934][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:53:55,258][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:53:55,582][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:53:55,906][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:53:56,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:53:56,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:53:57,241][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:53:57,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:57,959][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:57,961][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:58,993][__main__][INFO] - Iteration 300 took 24s (41.42% Gen, 54.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 14m 15s. Estimated total time: 20h 4m 25s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 8s, 500 more iterations: 3h 20m 44s. [2025-11-13 09:53:58,995][__main__][INFO] - Starting iteration 300. [2025-11-13 09:53:58,999][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 09:53:58,999][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:08,518][__main__][INFO] - Number of regex retries in iteration 300: 0 [2025-11-13 09:54:08,519][__main__][INFO] - agents played in iteration 300 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:54:08,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:09,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:09,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:09,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:09,087][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:09,088][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:09,831][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:10,127][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:10,453][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:10,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:11,107][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:11,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:11,758][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:12,083][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:12,412][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:12,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:13,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:13,384][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:13,710][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:14,033][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:14,682][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:15,006][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:15,331][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:15,658][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:16,952][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:17,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:17,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:18,574][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:18,897][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:19,222][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:19,546][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:19,871][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:20,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:20,887][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:21,629][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:21,631][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:21,632][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:23,703][__main__][INFO] - Iteration 301 took 24s (38.53% Gen, 53.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 44m 40s. Estimated total time: 20h 35m 15s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 10s, 500 more iterations: 3h 25m 52s. [2025-11-13 09:54:23,705][__main__][INFO] - Starting iteration 301. [2025-11-13 09:54:23,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:23,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:33,411][__main__][INFO] - Number of regex retries in iteration 301: 0 [2025-11-13 09:54:33,412][__main__][INFO] - agents played in iteration 301 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:54:33,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:33,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:33,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:33,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:33,983][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:33,983][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:34,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:35,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:35,654][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:35,978][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:54:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:54:36,955][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:54:37,279][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:54:37,605][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:54:37,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:54:38,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:54:38,587][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:54:38,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:54:39,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:54:39,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:54:39,888][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:54:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:54:40,534][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:54:40,857][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:54:41,180][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:54:41,506][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:54:41,832][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:54:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:54:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:54:42,810][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:54:43,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:54:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:54:43,785][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:54:44,112][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:54:44,438][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:54:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:54:45,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:54:45,782][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:54:46,513][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:46,514][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:46,516][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:47,527][__main__][INFO] - Iteration 302 took 23s (40.73% Gen, 55.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 59m 58s. Estimated total time: 19h 50m 57s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 29s. [2025-11-13 09:54:47,529][__main__][INFO] - Starting iteration 302. [2025-11-13 09:54:47,532][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:54:47,533][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:54:57,089][__main__][INFO] - Number of regex retries in iteration 302: 0 [2025-11-13 09:54:57,090][__main__][INFO] - agents played in iteration 302 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:54:57,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:57,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:57,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:57,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:54:57,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:54:57,673][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:54:58,400][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:54:58,697][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:54:59,022][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:54:59,346][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:54:59,671][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:54:59,996][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:00,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:00,651][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:00,977][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:01,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:03,260][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:03,585][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:03,909][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:04,234][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:04,884][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:05,538][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:06,187][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:06,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:06,837][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:07,161][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:07,811][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:08,137][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:08,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:08,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:09,471][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:10,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:10,202][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:10,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:11,212][__main__][INFO] - Iteration 303 took 23s (40.36% Gen, 55.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 39s. Estimated total time: 19h 44m 1s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 20s. [2025-11-13 09:55:11,214][__main__][INFO] - Starting iteration 303. [2025-11-13 09:55:11,217][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:11,218][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:19,579][__main__][INFO] - Number of regex retries in iteration 303: 0 [2025-11-13 09:55:19,579][__main__][INFO] - agents played in iteration 303 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:55:20,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:20,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:20,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:20,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:20,162][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:20,163][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:21,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:21,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:21,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:22,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:23,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:23,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:23,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:24,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:25,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:25,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:26,068][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:26,396][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:26,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:27,372][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:27,697][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:28,027][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:28,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:28,678][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:29,326][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:30,297][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:30,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:31,266][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:31,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:32,720][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:32,721][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:32,723][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:33,679][__main__][INFO] - Iteration 304 took 22s (37.22% Gen, 58.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 51m 23s. Estimated total time: 18h 43m 8s. Time estimates for 10 more iterations: 3m 44s, 100 more iterations: 37m 26s, 500 more iterations: 3h 7m 11s. [2025-11-13 09:55:33,681][__main__][INFO] - Starting iteration 304. [2025-11-13 09:55:33,684][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:33,685][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:42,184][__main__][INFO] - Number of regex retries in iteration 304: 0 [2025-11-13 09:55:42,184][__main__][INFO] - agents played in iteration 304 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:55:42,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:42,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:42,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:42,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:42,751][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:42,752][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:43,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:55:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:55:44,108][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:55:44,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:55:44,758][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:55:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:55:45,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:55:45,732][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:55:46,058][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:55:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:55:46,707][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:55:47,030][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:55:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:55:47,685][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:55:48,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:55:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:55:48,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:55:48,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:55:49,319][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:55:49,644][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:55:49,969][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:55:50,292][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:55:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:55:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:55:51,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:55:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:55:51,912][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:55:52,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:55:52,557][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:55:52,882][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:55:53,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:55:53,530][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:55:53,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:55:54,525][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:55:55,251][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:55,253][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:55,254][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:56,215][__main__][INFO] - Iteration 305 took 22s (37.72% Gen, 58.01% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 54m 27s. Estimated total time: 18h 46m 34s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 33s, 500 more iterations: 3h 7m 45s. [2025-11-13 09:55:56,217][__main__][INFO] - Starting iteration 305. [2025-11-13 09:55:56,220][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:55:56,221][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:05,202][__main__][INFO] - Number of regex retries in iteration 305: 0 [2025-11-13 09:56:05,202][__main__][INFO] - agents played in iteration 305 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:56:05,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:05,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:05,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:05,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:05,780][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:05,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:06,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:07,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:07,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:07,796][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:08,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:08,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:08,770][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:09,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:09,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:10,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:10,724][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:11,048][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:11,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:11,695][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:12,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:12,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:13,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:13,651][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:13,976][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:14,301][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:14,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:14,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:15,278][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:15,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:16,256][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:16,581][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:16,904][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:17,585][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:18,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:18,314][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:18,316][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:19,276][__main__][INFO] - Iteration 306 took 23s (38.95% Gen, 56.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 20m 19s. Estimated total time: 19h 12m 49s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 8s. [2025-11-13 09:56:19,278][__main__][INFO] - Starting iteration 306. [2025-11-13 09:56:19,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:19,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:28,113][__main__][INFO] - Number of regex retries in iteration 306: 0 [2025-11-13 09:56:28,114][__main__][INFO] - agents played in iteration 306 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:56:28,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:28,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:28,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:28,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:28,695][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:28,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:29,725][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:30,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:30,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:30,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:31,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:31,347][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:31,671][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:31,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:32,645][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:32,970][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:33,297][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:33,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:33,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:34,281][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:34,606][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:34,931][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:35,255][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:35,908][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:36,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:56:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:56:37,217][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:56:37,542][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:56:37,870][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:56:38,197][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:56:38,522][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:56:38,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:56:39,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:56:39,507][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:56:39,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:56:40,517][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:56:41,233][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:41,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:41,237][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:42,332][__main__][INFO] - Iteration 307 took 23s (38.31% Gen, 56.93% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 19m 41s. Estimated total time: 19h 12m 35s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 5s. [2025-11-13 09:56:42,334][__main__][INFO] - Starting iteration 307. [2025-11-13 09:56:42,338][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:56:42,338][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:51,508][__main__][INFO] - Number of regex retries in iteration 307: 0 [2025-11-13 09:56:51,508][__main__][INFO] - agents played in iteration 307 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:56:52,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:52,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:52,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:52,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:52,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:52,114][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:56:53,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:56:53,466][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:56:53,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:56:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:56:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:56:54,762][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:56:55,086][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:56:55,414][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:56:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:56:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:56:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:56:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:56:57,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:56:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:56:57,691][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:56:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:56:58,339][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:56:58,662][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:56:58,987][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:56:59,312][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:56:59,636][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:56:59,960][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:00,287][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:00,612][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:00,939][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:01,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:01,590][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:02,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:02,570][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:02,893][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:03,216][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:03,886][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:04,604][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:04,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:04,607][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:05,562][__main__][INFO] - Iteration 308 took 23s (39.48% Gen, 56.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 57s. Estimated total time: 19h 21m 13s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 32s. [2025-11-13 09:57:05,564][__main__][INFO] - Starting iteration 308. [2025-11-13 09:57:05,568][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:05,568][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:14,426][__main__][INFO] - Number of regex retries in iteration 308: 0 [2025-11-13 09:57:14,426][__main__][INFO] - agents played in iteration 308 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:57:14,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:14,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:14,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:15,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:15,001][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:15,001][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:15,726][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:16,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:16,669][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:16,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:17,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:17,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:17,968][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:18,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:18,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:18,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:19,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:19,594][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:19,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:20,245][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:20,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:20,895][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:21,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:21,546][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:21,875][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:22,200][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:22,526][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:22,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:23,827][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:24,155][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:24,483][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:24,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:25,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:25,781][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:26,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:26,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:27,515][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:27,517][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:27,518][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:28,684][__main__][INFO] - Iteration 309 took 23s (38.32% Gen, 56.63% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 22m 12s. Estimated total time: 19h 15m 51s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 38s. [2025-11-13 09:57:28,686][__main__][INFO] - Starting iteration 309. [2025-11-13 09:57:28,689][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:28,690][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:37,170][__main__][INFO] - Number of regex retries in iteration 309: 0 [2025-11-13 09:57:37,171][__main__][INFO] - agents played in iteration 309 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:57:37,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:37,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:37,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:37,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:37,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:37,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:57:38,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:57:39,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:57:39,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:57:39,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:57:40,085][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:57:40,411][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:57:40,735][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:57:41,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:57:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:57:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:57:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:57:42,361][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:57:42,687][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:57:43,014][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:57:43,338][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:57:43,666][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:57:43,990][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:57:44,315][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:57:44,645][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:57:44,974][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:57:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:57:45,628][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:57:45,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:57:46,283][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:57:46,609][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:57:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:57:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:57:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:57:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:57:48,246][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:57:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:57:48,903][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:57:49,606][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:57:50,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:50,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:50,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:57:51,401][__main__][INFO] - Iteration 310 took 22s (37.34% Gen, 58.04% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 1m 36s. Estimated total time: 18h 55m 38s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 51s, 500 more iterations: 3h 9m 16s. [2025-11-13 09:57:51,403][__main__][INFO] - Starting iteration 310. [2025-11-13 09:57:51,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 09:57:51,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:00,428][__main__][INFO] - Number of regex retries in iteration 310: 0 [2025-11-13 09:58:00,429][__main__][INFO] - agents played in iteration 310 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:58:00,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:00,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:00,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:01,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:01,001][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:01,002][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:02,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:02,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:03,010][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:03,335][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:03,983][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:04,633][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:05,283][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:05,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:06,259][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:06,586][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:06,910][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:07,232][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:07,559][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:08,208][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:08,536][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:08,862][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:09,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:10,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:11,142][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:11,466][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:11,792][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:12,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:12,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:13,682][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:13,683][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:13,694][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:16,204][__main__][INFO] - Iteration 311 took 24s (36.38% Gen, 53.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 45m 30s. Estimated total time: 20h 39m 57s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 19s, 500 more iterations: 3h 26m 39s. [2025-11-13 09:58:16,206][__main__][INFO] - Starting iteration 311. [2025-11-13 09:58:16,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:16,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:25,885][__main__][INFO] - Number of regex retries in iteration 311: 0 [2025-11-13 09:58:25,886][__main__][INFO] - agents played in iteration 311 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:58:26,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:26,457][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:26,458][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:27,203][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:28,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:28,800][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:29,127][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:29,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:29,776][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:30,100][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:30,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:31,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:31,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:32,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:32,371][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:32,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:33,019][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:34,970][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:35,294][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:35,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:35,945][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:36,269][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:36,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:36,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:58:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:58:37,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:58:38,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:58:38,978][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:58:38,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:58:38,981][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:40,226][__main__][INFO] - Iteration 312 took 24s (40.29% Gen, 54.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 6m 1s. Estimated total time: 20h 0m 52s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 8s. [2025-11-13 09:58:40,228][__main__][INFO] - Starting iteration 312. [2025-11-13 09:58:40,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:58:40,232][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:48,734][__main__][INFO] - Number of regex retries in iteration 312: 0 [2025-11-13 09:58:48,735][__main__][INFO] - agents played in iteration 312 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:58:49,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:49,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:49,311][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:50,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:58:50,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:58:50,668][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:58:50,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:58:51,315][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:58:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:58:51,965][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:58:52,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:58:52,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:58:52,939][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:58:53,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:58:53,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:58:53,913][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:58:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:58:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:58:54,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:58:55,216][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:58:55,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:58:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:58:56,190][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:58:56,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:58:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:58:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:58:57,488][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:58:57,812][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:58:58,143][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:58:58,469][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:58:58,794][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:58:59,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:58:59,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:58:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:00,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:00,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:01,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:01,825][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:01,827][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:01,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:02,813][__main__][INFO] - Iteration 313 took 22s (37.65% Gen, 57.98% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 53m 53s. Estimated total time: 18h 49m 6s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 38s, 500 more iterations: 3h 8m 11s. [2025-11-13 09:59:02,815][__main__][INFO] - Starting iteration 313. [2025-11-13 09:59:02,818][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:02,819][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:11,946][__main__][INFO] - Number of regex retries in iteration 313: 0 [2025-11-13 09:59:11,946][__main__][INFO] - agents played in iteration 313 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:59:12,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:12,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:12,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:12,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:12,520][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:12,521][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:14,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:15,173][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:15,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:15,822][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:16,147][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:16,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:16,795][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:17,118][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:17,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:18,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:19,066][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:19,391][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:19,716][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:20,365][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:20,689][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:21,016][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:21,346][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:21,677][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:22,003][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:22,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:22,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:23,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:24,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:25,028][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:25,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:25,031][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:25,983][__main__][INFO] - Iteration 314 took 23s (39.40% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 38s. Estimated total time: 19h 18m 15s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 2s. [2025-11-13 09:59:25,985][__main__][INFO] - Starting iteration 314. [2025-11-13 09:59:25,988][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:25,988][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:35,226][__main__][INFO] - Number of regex retries in iteration 314: 0 [2025-11-13 09:59:35,227][__main__][INFO] - agents played in iteration 314 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:59:35,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:35,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:35,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:35,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:35,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:35,802][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:36,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 09:59:36,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 09:59:37,153][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 09:59:37,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 09:59:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 09:59:38,129][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 09:59:38,454][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 09:59:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 09:59:39,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 09:59:39,428][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 09:59:39,751][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 09:59:40,077][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 09:59:40,402][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 09:59:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 09:59:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 09:59:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 09:59:41,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 09:59:42,032][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 09:59:42,361][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 09:59:42,688][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 09:59:43,012][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 09:59:43,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 09:59:43,660][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 09:59:43,984][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 09:59:44,309][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 09:59:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 09:59:44,961][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 09:59:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 09:59:45,622][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 09:59:45,947][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 09:59:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 09:59:46,599][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 09:59:46,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 09:59:47,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 09:59:48,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:48,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:48,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:49,467][__main__][INFO] - Iteration 315 took 23s (39.34% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 59s. Estimated total time: 19h 33m 59s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 7s, 500 more iterations: 3h 15m 39s. [2025-11-13 09:59:49,469][__main__][INFO] - Starting iteration 315. [2025-11-13 09:59:49,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 09:59:49,473][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:58,946][__main__][INFO] - Number of regex retries in iteration 315: 0 [2025-11-13 09:59:58,947][__main__][INFO] - agents played in iteration 315 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 09:59:59,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:59,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:59,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:59,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:59,525][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:59,526][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:00,555][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:00,878][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:01,204][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:01,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:01,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:02,178][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:02,827][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:03,151][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:03,798][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:04,124][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:04,777][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:05,765][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:06,092][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:06,418][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:06,742][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:08,042][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:08,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:08,690][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:09,018][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:10,001][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:10,328][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:10,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:11,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:12,047][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:12,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:12,050][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:13,086][__main__][INFO] - Iteration 316 took 23s (40.12% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 44m 19s. Estimated total time: 19h 40m 43s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 47s. [2025-11-13 10:00:13,088][__main__][INFO] - Starting iteration 316. [2025-11-13 10:00:13,092][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:13,093][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:21,884][__main__][INFO] - Number of regex retries in iteration 316: 0 [2025-11-13 10:00:21,885][__main__][INFO] - agents played in iteration 316 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:00:22,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:22,394][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:22,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:22,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:22,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:22,463][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:23,206][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:23,501][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:24,149][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:24,797][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:26,096][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:26,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:26,744][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:27,392][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:27,719][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:29,347][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:29,670][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:29,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:30,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:30,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:31,290][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:31,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:31,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:32,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:32,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:32,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:33,239][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:33,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:34,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:34,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:34,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:34,963][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:36,175][__main__][INFO] - Iteration 317 took 23s (38.09% Gen, 56.65% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 17m 24s. Estimated total time: 19h 14m 11s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 21s. [2025-11-13 10:00:36,177][__main__][INFO] - Starting iteration 317. [2025-11-13 10:00:36,180][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:36,181][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:45,010][__main__][INFO] - Number of regex retries in iteration 317: 0 [2025-11-13 10:00:45,010][__main__][INFO] - agents played in iteration 317 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:00:45,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:45,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:46,331][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:00:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:00:46,949][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:00:47,273][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:00:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:00:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:00:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:00:48,569][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:00:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:00:49,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:00:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:00:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:00:50,189][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:00:50,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:00:50,839][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:00:51,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:00:51,493][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:00:51,821][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:00:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:00:52,471][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:00:52,799][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:00:53,126][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:00:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:00:53,774][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:00:54,099][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:00:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:00:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:00:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:00:55,396][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:00:55,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:00:56,048][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:00:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:00:56,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:00:57,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:00:58,101][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:58,102][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:58,104][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:59,074][__main__][INFO] - Iteration 318 took 22s (38.57% Gen, 57.19% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 7m 34s. Estimated total time: 19h 4m 43s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 9s, 500 more iterations: 3h 10m 47s. [2025-11-13 10:00:59,076][__main__][INFO] - Starting iteration 318. [2025-11-13 10:00:59,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:00:59,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:07,793][__main__][INFO] - Number of regex retries in iteration 318: 0 [2025-11-13 10:01:07,794][__main__][INFO] - agents played in iteration 318 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:01:08,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:08,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:08,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:08,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:08,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:08,367][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:09,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:09,401][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:09,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:10,052][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:10,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:10,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:11,676][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:12,000][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:12,325][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:12,976][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:13,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:13,626][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:13,950][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:14,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:14,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:14,926][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:15,250][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:15,898][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:16,223][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:16,546][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:16,869][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:17,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:17,523][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:17,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:18,174][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:18,504][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:18,834][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:19,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:20,189][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:20,904][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:20,906][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:20,908][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:21,867][__main__][INFO] - Iteration 319 took 22s (38.24% Gen, 57.55% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 1m 53s. Estimated total time: 18h 59m 26s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 54s. [2025-11-13 10:01:21,869][__main__][INFO] - Starting iteration 319. [2025-11-13 10:01:21,872][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:21,873][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:30,455][__main__][INFO] - Number of regex retries in iteration 319: 0 [2025-11-13 10:01:30,456][__main__][INFO] - agents played in iteration 319 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:01:30,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:30,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:30,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:31,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:31,028][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:31,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:32,059][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:32,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:33,032][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:33,355][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:34,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:34,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:34,976][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:35,301][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:35,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:35,949][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:36,274][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:36,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:36,923][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:37,247][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:01:37,572][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:01:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:01:38,222][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:01:38,547][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:01:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:01:39,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:01:39,521][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:01:39,845][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:01:40,170][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:01:40,493][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:01:40,817][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:01:41,145][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:01:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:01:41,793][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:01:42,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:01:42,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:01:43,597][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:43,599][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:43,601][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:44,580][__main__][INFO] - Iteration 320 took 22s (37.80% Gen, 57.89% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 57m 30s. Estimated total time: 18h 55m 25s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 50s, 500 more iterations: 3h 9m 14s. [2025-11-13 10:01:44,582][__main__][INFO] - Starting iteration 320. [2025-11-13 10:01:44,585][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 10:01:44,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:53,084][__main__][INFO] - Number of regex retries in iteration 320: 0 [2025-11-13 10:01:53,085][__main__][INFO] - agents played in iteration 320 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:01:53,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:53,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:53,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:53,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:53,645][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:53,645][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:01:54,673][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:01:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:01:55,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:01:55,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:01:55,977][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:01:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:01:56,629][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:01:56,954][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:01:57,278][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:01:57,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:01:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:01:58,255][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:01:58,579][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:01:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:01:59,231][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:01:59,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:01:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:00,528][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:00,852][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:01,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:01,825][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:02,476][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:03,133][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:03,459][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:04,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:04,433][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:04,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:05,490][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:06,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:06,229][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:06,231][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:08,112][__main__][INFO] - Iteration 321 took 23s (36.12% Gen, 55.87% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 38m 2s. Estimated total time: 19h 36m 21s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 3s. [2025-11-13 10:02:08,114][__main__][INFO] - Starting iteration 321. [2025-11-13 10:02:08,117][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:08,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:17,548][__main__][INFO] - Number of regex retries in iteration 321: 0 [2025-11-13 10:02:17,549][__main__][INFO] - agents played in iteration 321 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:02:18,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:18,044][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:18,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:18,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:18,110][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:18,111][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:18,849][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:19,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:19,470][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:19,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:20,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:20,445][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:20,769][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:21,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:21,418][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:21,744][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:22,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:22,394][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:23,043][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:23,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:23,692][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:24,341][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:24,990][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:25,315][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:26,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:27,268][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:27,597][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:27,924][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:28,248][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:28,576][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:29,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:29,986][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:30,721][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:30,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:30,724][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:31,735][__main__][INFO] - Iteration 322 took 23s (39.93% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 12s. Estimated total time: 19h 40m 55s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 10:02:31,737][__main__][INFO] - Starting iteration 322. [2025-11-13 10:02:31,740][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:31,741][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:40,888][__main__][INFO] - Number of regex retries in iteration 322: 0 [2025-11-13 10:02:40,889][__main__][INFO] - agents played in iteration 322 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:02:41,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:41,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:41,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:41,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:41,482][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:41,483][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:02:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:02:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:02:43,167][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:02:43,494][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:02:43,820][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:02:44,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:02:44,469][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:02:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:02:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:02:45,441][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:02:45,767][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:02:46,090][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:02:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:02:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:02:47,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:02:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:02:47,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:02:48,033][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:02:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:02:48,682][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:02:49,006][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:02:49,329][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:02:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:02:49,978][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:02:50,302][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:02:50,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:02:50,950][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:02:51,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:02:51,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:02:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:02:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:02:52,572][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:02:53,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:02:54,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:54,008][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:54,010][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:54,972][__main__][INFO] - Iteration 323 took 23s (39.38% Gen, 56.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 22m 32s. Estimated total time: 19h 21m 38s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 36s. [2025-11-13 10:02:54,974][__main__][INFO] - Starting iteration 323. [2025-11-13 10:02:54,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:02:54,978][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:03,513][__main__][INFO] - Number of regex retries in iteration 323: 0 [2025-11-13 10:03:03,513][__main__][INFO] - agents played in iteration 323 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:03:03,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:04,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:04,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:04,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:04,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:04,083][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:04,788][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:05,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:05,736][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:06,061][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:06,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:06,713][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:07,374][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:08,032][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:08,356][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:08,680][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:09,005][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:09,653][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:09,977][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:10,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:10,626][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:10,950][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:11,275][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:11,601][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:11,925][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:12,249][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:12,574][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:13,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:13,875][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:14,201][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:14,848][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:15,171][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:15,862][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:16,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:16,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:16,611][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:17,633][__main__][INFO] - Iteration 324 took 22s (37.67% Gen, 57.81% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 53m 21s. Estimated total time: 18h 52m 49s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 45s, 500 more iterations: 3h 8m 48s. [2025-11-13 10:03:17,636][__main__][INFO] - Starting iteration 324. [2025-11-13 10:03:17,639][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:17,639][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:25,928][__main__][INFO] - Number of regex retries in iteration 324: 0 [2025-11-13 10:03:25,929][__main__][INFO] - agents played in iteration 324 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:03:26,392][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:26,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:26,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:26,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:26,828][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:26,829][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:27,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:27,817][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:28,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:28,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:29,122][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:29,449][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:29,773][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:30,100][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:30,753][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:31,080][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:31,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:31,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:32,059][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:33,030][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:33,353][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:33,677][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:34,001][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:34,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:35,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:35,947][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:36,272][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:36,596][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:03:37,249][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:03:37,574][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:03:37,899][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:03:38,602][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:03:39,330][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:39,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:39,336][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:40,428][__main__][INFO] - Iteration 325 took 22s (36.37% Gen, 58.83% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 59m 37s. Estimated total time: 18h 59m 29s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 54s. [2025-11-13 10:03:40,430][__main__][INFO] - Starting iteration 325. [2025-11-13 10:03:40,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:03:40,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:49,303][__main__][INFO] - Number of regex retries in iteration 325: 0 [2025-11-13 10:03:49,304][__main__][INFO] - agents played in iteration 325 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:03:49,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:49,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:49,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:49,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:49,865][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:49,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:50,558][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:03:50,851][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:03:51,176][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:03:51,499][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:03:51,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:03:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:03:52,475][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:03:52,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:03:53,126][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:03:53,454][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:03:53,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:03:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:03:54,437][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:03:54,765][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:03:55,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:03:55,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:03:55,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:03:56,062][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:03:56,387][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:03:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:03:57,033][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:03:57,359][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:03:57,683][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:03:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:03:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:03:58,657][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:03:58,981][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:03:59,306][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:03:59,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:03:59,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:00,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:01,629][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:02,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:02,351][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:02,353][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:03,437][__main__][INFO] - Iteration 326 took 23s (38.56% Gen, 56.72% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 9m 59s. Estimated total time: 19h 10m 13s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 20s, 500 more iterations: 3h 11m 42s. [2025-11-13 10:04:03,439][__main__][INFO] - Starting iteration 326. [2025-11-13 10:04:03,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:03,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:12,460][__main__][INFO] - Number of regex retries in iteration 326: 0 [2025-11-13 10:04:12,460][__main__][INFO] - agents played in iteration 326 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:04:12,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:12,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:12,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:13,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:13,024][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:13,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:13,728][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:14,674][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:15,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:15,649][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:15,974][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:16,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:16,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:16,948][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:17,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:17,597][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:17,924][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:18,898][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:19,222][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:19,872][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:20,195][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:21,168][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:21,817][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:22,143][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:22,468][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:22,792][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:23,441][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:23,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:24,090][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:24,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:25,532][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:25,533][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:25,535][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:26,516][__main__][INFO] - Iteration 327 took 23s (39.08% Gen, 56.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 13m 6s. Estimated total time: 19h 13m 44s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 17s. [2025-11-13 10:04:26,518][__main__][INFO] - Starting iteration 327. [2025-11-13 10:04:26,521][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:26,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:35,085][__main__][INFO] - Number of regex retries in iteration 327: 0 [2025-11-13 10:04:35,086][__main__][INFO] - agents played in iteration 327 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:04:35,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:35,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:35,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:35,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:35,657][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:35,658][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:36,394][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:36,688][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:37,013][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:04:37,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:04:37,659][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:04:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:04:38,308][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:04:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:04:38,955][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:04:39,281][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:04:39,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:04:39,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:04:40,254][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:04:40,579][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:04:40,903][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:04:41,228][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:04:41,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:04:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:04:42,207][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:04:42,530][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:04:42,855][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:04:43,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:04:43,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:04:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:04:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:04:44,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:04:44,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:04:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:04:45,447][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:04:45,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:04:46,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:04:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:04:46,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:04:47,467][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:04:48,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:48,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:48,173][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:49,139][__main__][INFO] - Iteration 328 took 22s (37.86% Gen, 57.86% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 49m 56s. Estimated total time: 18h 50m 56s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 41s, 500 more iterations: 3h 8m 29s. [2025-11-13 10:04:49,142][__main__][INFO] - Starting iteration 328. [2025-11-13 10:04:49,144][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:04:49,145][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:57,791][__main__][INFO] - Number of regex retries in iteration 328: 0 [2025-11-13 10:04:57,791][__main__][INFO] - agents played in iteration 328 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:04:58,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:58,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:58,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:58,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:58,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:58,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:59,088][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:04:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:04:59,707][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:00,355][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:00,678][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:01,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:01,653][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:01,977][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:02,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:02,626][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:02,949][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:03,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:03,600][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:03,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:04,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:04,576][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:05,226][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:05,876][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:06,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:06,527][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:07,177][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:07,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:08,476][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:08,801][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:09,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:10,164][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:10,923][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:10,924][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:10,926][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:11,907][__main__][INFO] - Iteration 329 took 22s (37.98% Gen, 57.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 56m 48s. Estimated total time: 18h 58m 11s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 56s, 500 more iterations: 3h 9m 41s. [2025-11-13 10:05:11,909][__main__][INFO] - Starting iteration 329. [2025-11-13 10:05:11,912][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:11,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:21,265][__main__][INFO] - Number of regex retries in iteration 329: 0 [2025-11-13 10:05:21,265][__main__][INFO] - agents played in iteration 329 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:05:21,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:21,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:21,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:21,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:21,845][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:21,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:22,567][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:23,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:24,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:24,809][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:25,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:25,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:25,782][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:26,107][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:26,431][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:26,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:27,086][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:27,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:27,741][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:28,069][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:28,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:29,046][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:29,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:29,695][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:30,020][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:31,318][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:31,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:32,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:32,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:33,656][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:34,366][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:34,367][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:34,369][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:35,391][__main__][INFO] - Iteration 330 took 23s (39.83% Gen, 55.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 14s. Estimated total time: 19h 34m 1s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 40s. [2025-11-13 10:05:35,393][__main__][INFO] - Starting iteration 330. [2025-11-13 10:05:35,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 10:05:35,397][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:05:44,358][__main__][INFO] - Number of regex retries in iteration 330: 0 [2025-11-13 10:05:44,359][__main__][INFO] - agents played in iteration 330 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:05:44,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:45,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:45,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:45,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:05:45,252][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:05:45,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:05:45,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:05:46,270][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:05:46,593][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:05:46,917][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:05:47,240][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:05:47,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:05:47,891][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:05:48,215][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:05:48,539][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:05:48,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:05:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:05:49,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:05:49,837][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:05:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:05:50,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:05:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:05:51,146][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:05:51,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:05:51,801][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:05:52,127][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:05:52,451][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:05:52,775][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:05:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:05:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:05:53,747][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:05:54,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:05:54,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:05:54,721][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:05:55,046][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:05:55,370][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:05:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:05:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:05:56,346][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:05:57,067][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:05:57,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:57,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:57,802][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:59,762][__main__][INFO] - Iteration 331 took 24s (36.78% Gen, 55.17% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 8s. Estimated total time: 20h 18m 18s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 3s. [2025-11-13 10:05:59,764][__main__][INFO] - Starting iteration 331. [2025-11-13 10:05:59,767][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:05:59,768][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:09,099][__main__][INFO] - Number of regex retries in iteration 331: 0 [2025-11-13 10:06:09,099][__main__][INFO] - agents played in iteration 331 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:06:09,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:09,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:09,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:09,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:09,988][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:09,988][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:11,956][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:12,611][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:12,934][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:13,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:13,591][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:14,242][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:14,568][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:14,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:15,545][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:15,872][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:16,198][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:16,525][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:16,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:17,176][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:17,824][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:18,474][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:18,799][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:19,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:19,447][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:19,771][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:20,095][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:20,420][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:20,745][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:21,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:21,803][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:22,526][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:22,528][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:22,530][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:23,483][__main__][INFO] - Iteration 332 took 23s (39.35% Gen, 56.63% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 43m 16s. Estimated total time: 19h 45m 50s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 38s. [2025-11-13 10:06:23,485][__main__][INFO] - Starting iteration 332. [2025-11-13 10:06:23,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:23,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:32,224][__main__][INFO] - Number of regex retries in iteration 332: 0 [2025-11-13 10:06:32,225][__main__][INFO] - agents played in iteration 332 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:06:32,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:32,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:32,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:32,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:32,791][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:32,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:34,136][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:34,789][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:35,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:35,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:36,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:36,410][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:36,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:06:37,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:06:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:06:37,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:06:38,034][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:06:38,359][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:06:38,686][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:06:39,018][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:06:39,346][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:06:39,675][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:06:40,005][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:06:40,331][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:06:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:06:40,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:06:41,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:06:41,631][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:06:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:06:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:06:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:06:42,931][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:06:43,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:06:43,580][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:06:43,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:06:44,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:06:45,314][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:45,315][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:45,317][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:46,268][__main__][INFO] - Iteration 333 took 22s (38.35% Gen, 57.47% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 56m 4s. Estimated total time: 18h 59m 1s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 50s. [2025-11-13 10:06:46,270][__main__][INFO] - Starting iteration 333. [2025-11-13 10:06:46,273][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:06:46,273][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:54,986][__main__][INFO] - Number of regex retries in iteration 333: 0 [2025-11-13 10:06:54,986][__main__][INFO] - agents played in iteration 333 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:06:55,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:55,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:55,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:55,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:55,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:55,887][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:06:56,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:06:57,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:06:57,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:06:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:06:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:06:58,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:06:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:06:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:06:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:06:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:00,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:00,804][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:01,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:01,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:02,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:02,761][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:03,087][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:03,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:04,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:05,037][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:05,361][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:05,685][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:06,009][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:06,333][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:06,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:07,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:08,411][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:08,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:08,414][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:09,542][__main__][INFO] - Iteration 334 took 23s (37.44% Gen, 57.70% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 20m 10s. Estimated total time: 19h 23m 31s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 47s, 500 more iterations: 3h 13m 55s. [2025-11-13 10:07:09,544][__main__][INFO] - Starting iteration 334. [2025-11-13 10:07:09,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:09,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:18,070][__main__][INFO] - Number of regex retries in iteration 334: 0 [2025-11-13 10:07:18,070][__main__][INFO] - agents played in iteration 334 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:07:18,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:18,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:18,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:18,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:18,636][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:18,637][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:19,351][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:19,645][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:19,970][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:20,294][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:20,620][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:20,949][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:21,272][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:21,597][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:21,920][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:22,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:22,898][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:23,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:23,554][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:23,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:24,203][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:24,853][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:25,183][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:25,510][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:25,835][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:26,160][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:27,140][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:28,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:28,439][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:28,764][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:29,089][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:29,412][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:29,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:30,448][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:31,154][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:31,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:31,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:32,385][__main__][INFO] - Iteration 335 took 22s (37.31% Gen, 57.31% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 58m 10s. Estimated total time: 19h 1m 53s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 18s. [2025-11-13 10:07:32,387][__main__][INFO] - Starting iteration 335. [2025-11-13 10:07:32,391][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:32,391][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:41,365][__main__][INFO] - Number of regex retries in iteration 335: 0 [2025-11-13 10:07:41,366][__main__][INFO] - agents played in iteration 335 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:07:41,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:41,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:41,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:41,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:41,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:41,933][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:42,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:07:42,948][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:07:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:07:43,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:07:43,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:07:44,244][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:07:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:07:44,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:07:45,222][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:07:45,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:07:45,871][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:07:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:07:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:07:46,849][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:07:47,175][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:07:47,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:07:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:07:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:07:48,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:07:48,809][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:07:49,133][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:07:49,461][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:07:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:07:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:07:50,441][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:07:50,768][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:07:51,093][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:07:51,418][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:07:51,742][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:07:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:07:52,392][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:07:52,717][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:07:53,041][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:07:53,759][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:07:54,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:54,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:54,454][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:55,345][__main__][INFO] - Iteration 336 took 22s (39.09% Gen, 57.02% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 3m 40s. Estimated total time: 19h 7m 46s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 15s, 500 more iterations: 3h 11m 17s. [2025-11-13 10:07:55,347][__main__][INFO] - Starting iteration 336. [2025-11-13 10:07:55,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:07:55,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:04,700][__main__][INFO] - Number of regex retries in iteration 336: 0 [2025-11-13 10:08:04,700][__main__][INFO] - agents played in iteration 336 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:08:05,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:05,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:05,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:05,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:05,268][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:05,268][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:06,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:06,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:07,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:07,588][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:07,913][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:08,236][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:08,559][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:08,883][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:09,207][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:09,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:10,185][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:10,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:10,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:11,485][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:12,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:12,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:13,106][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:13,434][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:13,760][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:15,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:16,034][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:16,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:17,078][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:17,762][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:17,764][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:17,765][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:18,641][__main__][INFO] - Iteration 337 took 23s (40.14% Gen, 56.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 4s. Estimated total time: 19h 24m 34s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 5s. [2025-11-13 10:08:18,643][__main__][INFO] - Starting iteration 337. [2025-11-13 10:08:18,645][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:18,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:28,061][__main__][INFO] - Number of regex retries in iteration 337: 0 [2025-11-13 10:08:28,062][__main__][INFO] - agents played in iteration 337 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:08:28,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:28,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:28,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:28,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:28,641][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:28,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:29,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:29,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:30,648][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:30,974][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:31,300][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:31,628][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:31,950][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:32,275][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:32,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:32,924][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:33,250][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:33,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:33,907][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:34,232][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:34,884][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:35,209][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:35,532][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:36,503][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:36,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:08:37,154][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:08:37,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:08:37,801][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:08:38,126][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:08:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:08:38,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:08:39,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:08:39,426][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:08:39,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:08:40,468][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:08:41,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:41,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:41,164][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:42,096][__main__][INFO] - Iteration 338 took 23s (40.15% Gen, 55.87% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 42s. Estimated total time: 19h 32m 35s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 25s. [2025-11-13 10:08:42,098][__main__][INFO] - Starting iteration 338. [2025-11-13 10:08:42,100][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:08:42,101][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:51,182][__main__][INFO] - Number of regex retries in iteration 338: 0 [2025-11-13 10:08:51,183][__main__][INFO] - agents played in iteration 338 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:08:51,648][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:51,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:51,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:51,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:51,750][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:51,750][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:52,477][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:08:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:08:53,096][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:08:53,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:08:53,747][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:08:54,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:08:54,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:08:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:08:55,049][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:08:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:08:55,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:08:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:08:56,343][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:08:56,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:08:56,992][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:08:57,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:08:57,641][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:08:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:08:58,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:08:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:08:58,937][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:08:59,267][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:08:59,596][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:08:59,924][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:00,248][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:00,574][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:00,898][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:01,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:01,548][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:01,872][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:02,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:02,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:03,562][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:04,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:04,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:04,259][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:05,392][__main__][INFO] - Iteration 339 took 23s (38.99% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 20s. Estimated total time: 19h 24m 36s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 6s. [2025-11-13 10:09:05,394][__main__][INFO] - Starting iteration 339. [2025-11-13 10:09:05,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:05,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:14,835][__main__][INFO] - Number of regex retries in iteration 339: 0 [2025-11-13 10:09:14,836][__main__][INFO] - agents played in iteration 339 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:09:15,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:15,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:15,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:15,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:15,399][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:15,400][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:16,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:16,732][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:17,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:17,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:17,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:18,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:18,351][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:18,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:19,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:19,333][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:19,657][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:19,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:20,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:20,641][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:21,293][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:21,622][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:21,948][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:22,272][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:22,929][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:23,258][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:23,914][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:24,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:24,565][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:24,890][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:25,214][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:25,540][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:25,864][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:26,189][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:26,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:27,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:27,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:27,932][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:27,934][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:28,920][__main__][INFO] - Iteration 340 took 23s (40.12% Gen, 55.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 30m 31s. Estimated total time: 19h 36m 11s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 1s. [2025-11-13 10:09:28,922][__main__][INFO] - Starting iteration 340. [2025-11-13 10:09:28,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 10:09:28,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:09:38,170][__main__][INFO] - Number of regex retries in iteration 340: 0 [2025-11-13 10:09:38,170][__main__][INFO] - agents played in iteration 340 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:09:38,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:38,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:38,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:38,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:09:38,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:09:38,753][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:09:39,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:09:39,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:09:40,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:09:40,389][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:09:40,712][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:09:41,036][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:09:41,361][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:09:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:09:42,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:09:42,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:09:42,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:09:42,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:09:43,319][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:09:43,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:09:43,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:09:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:09:44,617][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:09:44,941][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:09:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:09:45,592][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:09:45,919][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:09:46,248][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:09:46,572][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:09:46,898][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:09:47,223][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:09:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:09:47,876][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:09:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:09:48,525][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:09:48,849][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:09:49,173][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:09:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:09:49,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:09:50,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:09:51,264][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:09:51,266][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:09:51,267][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:09:53,151][__main__][INFO] - Iteration 341 took 24s (38.16% Gen, 54.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 5m 17s. Estimated total time: 20h 11m 21s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 53s. [2025-11-13 10:09:53,153][__main__][INFO] - Starting iteration 341. [2025-11-13 10:09:53,156][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:09:53,156][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:02,728][__main__][INFO] - Number of regex retries in iteration 341: 0 [2025-11-13 10:10:02,728][__main__][INFO] - agents played in iteration 341 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:10:03,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:03,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:03,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:03,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:03,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:03,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:04,927][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:05,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:05,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:06,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:06,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:06,870][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:08,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:08,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:09,456][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:09,781][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:10,106][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:10,432][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:10,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:11,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:11,406][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:11,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:12,054][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:13,034][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:13,684][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:14,009][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:14,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:15,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:15,768][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:15,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:15,773][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:16,724][__main__][INFO] - Iteration 342 took 23s (40.61% Gen, 55.35% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 59s. Estimated total time: 19h 38m 27s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 24s. [2025-11-13 10:10:16,726][__main__][INFO] - Starting iteration 342. [2025-11-13 10:10:16,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:16,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:26,093][__main__][INFO] - Number of regex retries in iteration 342: 0 [2025-11-13 10:10:26,094][__main__][INFO] - agents played in iteration 342 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:10:26,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:26,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:26,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:26,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:26,652][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:26,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:27,356][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:27,649][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:27,977][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:28,302][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:28,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:28,950][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:29,277][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:29,604][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:29,934][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:30,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:30,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:31,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:31,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:31,887][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:32,539][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:32,863][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:33,190][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:33,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:33,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:34,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:34,499][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:35,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:35,474][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:10:36,771][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:10:37,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:10:37,419][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:10:37,745][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:10:38,458][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:10:39,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:10:39,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:10:39,155][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:10:40,045][__main__][INFO] - Iteration 343 took 23s (40.16% Gen, 56.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 18m 57s. Estimated total time: 19h 25m 48s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 18s. [2025-11-13 10:10:40,047][__main__][INFO] - Starting iteration 343. [2025-11-13 10:10:40,049][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:10:40,050][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:10:49,395][__main__][INFO] - Number of regex retries in iteration 343: 0 [2025-11-13 10:10:49,395][__main__][INFO] - agents played in iteration 343 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:10:49,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:49,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:49,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:49,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:10:49,948][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:10:49,948][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:10:50,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:10:50,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:10:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:10:51,597][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:10:51,921][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:10:52,249][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:10:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:10:52,899][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:10:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:10:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:10:53,874][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:10:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:10:54,521][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:10:54,847][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:10:55,171][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:10:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:10:55,821][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:10:56,143][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:10:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:10:56,794][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:10:57,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:10:57,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:10:57,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:10:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:10:58,419][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:10:58,749][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:10:59,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:10:59,402][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:10:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:00,053][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:00,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:01,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:01,756][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:02,451][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:02,452][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:02,454][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:03,333][__main__][INFO] - Iteration 344 took 23s (40.13% Gen, 56.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 0s. Estimated total time: 19h 24m 14s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 2s. [2025-11-13 10:11:03,335][__main__][INFO] - Starting iteration 344. [2025-11-13 10:11:03,338][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:03,339][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:13,120][__main__][INFO] - Number of regex retries in iteration 344: 0 [2025-11-13 10:11:13,120][__main__][INFO] - agents played in iteration 344 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:11:13,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:13,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:13,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:13,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:13,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:13,676][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:14,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:15,313][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:15,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:16,287][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:16,614][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:16,939][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:17,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:17,586][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:18,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:18,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:18,882][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:19,205][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:19,529][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:20,498][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:21,147][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:22,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:22,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:22,782][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:23,108][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:23,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:24,409][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:24,734][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:25,447][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:26,157][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:26,158][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:26,160][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:27,055][__main__][INFO] - Iteration 345 took 23s (41.24% Gen, 54.98% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 38m 15s. Estimated total time: 19h 45m 53s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 38s. [2025-11-13 10:11:27,058][__main__][INFO] - Starting iteration 345. [2025-11-13 10:11:27,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:27,061][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:36,774][__main__][INFO] - Number of regex retries in iteration 345: 0 [2025-11-13 10:11:36,775][__main__][INFO] - agents played in iteration 345 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:11:37,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:37,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:37,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:37,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:11:37,655][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:11:37,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:11:38,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:11:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:11:38,970][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:11:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:11:39,626][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:11:39,956][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:11:40,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:11:40,607][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:11:40,936][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:11:41,262][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:11:41,586][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:11:41,911][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:11:42,236][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:11:42,563][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:11:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:11:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:11:43,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:11:43,871][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:11:44,198][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:11:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:11:44,851][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:11:45,177][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:11:45,510][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:11:45,838][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:11:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:11:46,487][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:11:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:11:47,134][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:11:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:11:47,784][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:11:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:11:48,434][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:11:48,758][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:11:49,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:11:50,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:11:50,178][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:11:50,179][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:11:51,066][__main__][INFO] - Iteration 346 took 24s (40.46% Gen, 55.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 15s. Estimated total time: 20h 0m 17s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 0s, 500 more iterations: 3h 20m 2s. [2025-11-13 10:11:51,068][__main__][INFO] - Starting iteration 346. [2025-11-13 10:11:51,071][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:11:51,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:11:59,304][__main__][INFO] - Number of regex retries in iteration 346: 0 [2025-11-13 10:11:59,305][__main__][INFO] - agents played in iteration 346 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:11:59,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:00,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:00,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:00,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:00,180][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:00,181][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:00,877][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:01,495][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:01,817][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:02,141][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:03,124][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:03,450][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:04,098][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:04,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:04,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:05,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:05,406][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:05,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:06,055][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:06,706][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:07,030][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:07,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:08,004][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:08,333][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:08,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:09,960][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:10,611][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:10,936][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:11,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:11,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:12,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:12,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:12,684][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:13,623][__main__][INFO] - Iteration 347 took 22s (36.51% Gen, 59.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 39m 15s. Estimated total time: 18h 47m 40s. Time estimates for 10 more iterations: 3m 45s, 100 more iterations: 37m 35s, 500 more iterations: 3h 7m 56s. [2025-11-13 10:12:13,625][__main__][INFO] - Starting iteration 347. [2025-11-13 10:12:13,628][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:13,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:23,789][__main__][INFO] - Number of regex retries in iteration 347: 0 [2025-11-13 10:12:23,790][__main__][INFO] - agents played in iteration 347 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:12:24,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:24,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:24,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:24,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:24,342][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:24,342][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:25,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:25,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:25,667][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:25,992][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:26,318][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:26,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:27,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:27,615][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:27,939][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:29,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:29,888][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:30,211][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:30,536][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:30,858][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:31,184][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:31,833][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:32,157][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:32,482][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:33,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:34,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:34,445][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:34,768][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:35,093][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:35,419][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:12:36,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:12:36,847][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:12:36,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:12:36,850][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:12:37,781][__main__][INFO] - Iteration 348 took 24s (42.07% Gen, 54.07% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 58m 52s. Estimated total time: 20h 7m 41s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 15s, 500 more iterations: 3h 21m 16s. [2025-11-13 10:12:37,783][__main__][INFO] - Starting iteration 348. [2025-11-13 10:12:37,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:12:37,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:12:47,984][__main__][INFO] - Number of regex retries in iteration 348: 0 [2025-11-13 10:12:47,985][__main__][INFO] - agents played in iteration 348 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:12:48,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:48,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:48,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:48,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:12:48,549][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:12:48,550][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:12:49,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:12:49,565][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:12:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:12:50,213][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:12:50,536][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:12:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:12:51,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:12:51,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:12:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:12:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:12:52,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:12:52,811][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:12:53,134][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:12:53,461][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:12:53,784][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:12:54,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:12:54,434][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:12:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:12:55,086][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:12:55,416][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:12:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:12:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:12:56,400][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:12:56,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:12:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:12:57,374][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:12:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:12:58,025][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:12:58,349][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:12:58,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:12:58,999][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:12:59,323][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:12:59,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:00,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:01,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:01,074][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:01,076][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:02,005][__main__][INFO] - Iteration 349 took 24s (42.10% Gen, 54.05% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 1m 48s. Estimated total time: 20h 11m 1s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 50s. [2025-11-13 10:13:02,008][__main__][INFO] - Starting iteration 349. [2025-11-13 10:13:02,010][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:02,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:12,265][__main__][INFO] - Number of regex retries in iteration 349: 0 [2025-11-13 10:13:12,265][__main__][INFO] - agents played in iteration 349 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:13:12,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:12,840][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:12,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:13,527][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:14,471][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:15,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:15,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:15,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:16,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:17,068][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:17,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:18,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:18,365][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:18,689][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:19,013][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:19,337][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:19,661][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:19,987][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:20,313][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:20,638][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:20,962][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:21,288][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:21,612][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:21,936][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:22,586][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:22,912][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:23,562][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:23,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:24,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:25,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:25,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:25,331][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:26,211][__main__][INFO] - Iteration 350 took 24s (42.37% Gen, 53.99% Train). Generation: 10s, Training: 13s. Estimated remaining time: 18h 0m 29s. Estimated total time: 20h 10m 6s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 20s, 500 more iterations: 3h 21m 41s. [2025-11-13 10:13:26,214][__main__][INFO] - Starting iteration 350. [2025-11-13 10:13:26,216][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 10:13:26,217][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:13:35,575][__main__][INFO] - Number of regex retries in iteration 350: 0 [2025-11-13 10:13:35,576][__main__][INFO] - agents played in iteration 350 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:13:36,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:13:36,162][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:13:36,162][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:13:36,872][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:13:37,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:13:37,490][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:13:37,813][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:13:38,136][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:13:38,461][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:13:38,785][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:13:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:13:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:13:39,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:13:40,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:13:40,408][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:13:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:13:41,058][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:13:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:13:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:13:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:13:42,355][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:13:42,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:13:43,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:13:43,331][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:13:43,658][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:13:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:13:44,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:13:44,638][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:13:44,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:13:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:13:45,615][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:13:45,939][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:13:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:13:46,588][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:13:46,913][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:13:47,239][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:13:47,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:13:48,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:13:48,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:13:48,672][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:13:50,783][__main__][INFO] - Iteration 351 took 24s (38.09% Gen, 53.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 20s. Estimated total time: 20h 28m 22s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 56s, 500 more iterations: 3h 24m 43s. [2025-11-13 10:13:50,785][__main__][INFO] - Starting iteration 351. [2025-11-13 10:13:50,787][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:13:50,788][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:00,317][__main__][INFO] - Number of regex retries in iteration 351: 0 [2025-11-13 10:14:00,318][__main__][INFO] - agents played in iteration 351 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:14:00,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:00,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:00,894][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:01,577][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:01,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:02,524][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:03,506][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:04,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:05,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:05,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:06,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:06,762][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:07,087][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:07,412][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:07,738][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:08,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:09,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:09,373][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:09,697][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:10,024][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:11,322][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:11,647][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:11,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:12,711][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:13,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:13,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:13,413][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:14,454][__main__][INFO] - Iteration 352 took 23s (40.27% Gen, 55.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 32m 57s. Estimated total time: 19h 43m 23s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 13s. [2025-11-13 10:14:14,456][__main__][INFO] - Starting iteration 352. [2025-11-13 10:14:14,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:14,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:24,298][__main__][INFO] - Number of regex retries in iteration 352: 0 [2025-11-13 10:14:24,298][__main__][INFO] - agents played in iteration 352 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:14:24,763][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:24,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:24,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:24,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:24,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:24,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:25,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:25,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:26,192][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:26,516][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:27,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:27,809][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:28,134][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:28,458][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:28,783][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:29,109][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:29,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:29,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:30,088][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:30,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:30,741][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:31,069][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:31,720][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:32,049][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:32,700][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:33,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:33,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:33,674][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:34,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:34,977][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:35,302][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:35,627][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:35,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:14:36,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:14:37,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:14:37,380][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:14:37,382][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:14:38,322][__main__][INFO] - Iteration 353 took 23s (41.22% Gen, 54.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 42m 21s. Estimated total time: 19h 53m 10s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 51s. [2025-11-13 10:14:38,325][__main__][INFO] - Starting iteration 353. [2025-11-13 10:14:38,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:14:38,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:14:47,845][__main__][INFO] - Number of regex retries in iteration 353: 0 [2025-11-13 10:14:47,846][__main__][INFO] - agents played in iteration 353 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:14:48,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:14:48,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:14:48,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:14:49,113][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:14:49,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:14:49,733][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:14:50,062][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:14:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:14:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:14:51,043][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:14:51,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:14:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:14:52,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:14:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:14:52,676][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:14:53,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:14:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:14:53,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:14:53,980][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:14:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:14:54,637][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:14:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:14:55,289][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:14:55,615][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:14:55,943][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:14:56,267][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:14:56,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:14:56,917][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:14:57,241][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:14:57,567][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:14:57,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:14:58,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:14:58,541][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:14:58,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:14:59,190][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:14:59,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:00,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:00,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:00,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:00,946][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:01,948][__main__][INFO] - Iteration 354 took 23s (40.29% Gen, 55.46% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 49s. Estimated total time: 19h 41m 2s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 50s. [2025-11-13 10:15:01,950][__main__][INFO] - Starting iteration 354. [2025-11-13 10:15:01,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:01,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:11,346][__main__][INFO] - Number of regex retries in iteration 354: 0 [2025-11-13 10:15:11,346][__main__][INFO] - agents played in iteration 354 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:15:11,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:11,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:11,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:11,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:11,920][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:11,921][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:12,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:12,930][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:13,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:14,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:14,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:14,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:15,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:16,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:16,500][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:17,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:18,135][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:18,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:19,760][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:20,086][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:20,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:20,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:21,061][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:21,387][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:21,714][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:22,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:23,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:23,733][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:24,437][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:24,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:24,440][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:25,407][__main__][INFO] - Iteration 355 took 23s (40.04% Gen, 55.83% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 8s. Estimated total time: 19h 32m 45s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 5s, 500 more iterations: 3h 15m 27s. [2025-11-13 10:15:25,409][__main__][INFO] - Starting iteration 355. [2025-11-13 10:15:25,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:25,413][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:35,023][__main__][INFO] - Number of regex retries in iteration 355: 0 [2025-11-13 10:15:35,024][__main__][INFO] - agents played in iteration 355 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:15:35,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:35,620][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:35,621][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:15:36,636][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:15:36,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:15:37,283][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:15:37,607][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:15:37,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:15:38,253][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:15:38,577][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:15:38,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:15:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:15:39,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:15:39,876][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:15:40,200][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:15:40,527][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:15:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:15:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:15:41,507][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:15:41,832][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:15:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:15:42,489][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:15:42,814][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:15:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:15:43,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:15:43,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:15:44,112][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:15:44,436][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:15:44,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:15:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:15:45,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:15:45,734][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:15:46,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:15:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:15:46,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:15:47,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:15:48,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:15:48,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:15:48,156][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:15:49,080][__main__][INFO] - Iteration 356 took 23s (40.60% Gen, 55.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 24s. Estimated total time: 19h 43m 24s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 14s. [2025-11-13 10:15:49,082][__main__][INFO] - Starting iteration 356. [2025-11-13 10:15:49,086][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:15:49,087][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:15:58,496][__main__][INFO] - Number of regex retries in iteration 356: 0 [2025-11-13 10:15:58,496][__main__][INFO] - agents played in iteration 356 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:15:58,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:15:59,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:15:59,068][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:15:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:00,749][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:01,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:01,403][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:01,729][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:02,383][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:02,708][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:03,358][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:03,689][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:04,343][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:05,331][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:05,663][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:05,988][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:06,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:06,964][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:07,290][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:07,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:07,941][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:08,590][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:08,916][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:09,239][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:09,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:09,888][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:10,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:10,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:11,660][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:11,662][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:11,664][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:12,621][__main__][INFO] - Iteration 357 took 23s (39.98% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 24m 24s. Estimated total time: 19h 36m 48s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 13s, 500 more iterations: 3h 16m 8s. [2025-11-13 10:16:12,623][__main__][INFO] - Starting iteration 357. [2025-11-13 10:16:12,627][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:12,627][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:21,716][__main__][INFO] - Number of regex retries in iteration 357: 0 [2025-11-13 10:16:21,717][__main__][INFO] - agents played in iteration 357 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:16:22,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:22,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:22,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:22,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:22,609][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:22,610][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:23,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:23,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:24,277][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:24,600][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:24,925][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:25,248][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:26,540][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:26,865][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:27,189][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:27,842][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:28,168][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:28,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:28,824][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:29,147][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:29,473][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:29,797][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:30,447][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:31,096][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:31,421][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:31,746][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:32,395][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:32,720][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:33,368][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:33,692][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:34,402][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:35,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:35,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:35,127][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:36,095][__main__][INFO] - Iteration 358 took 23s (38.73% Gen, 57.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 41s. Estimated total time: 19h 33m 28s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 34s. [2025-11-13 10:16:36,098][__main__][INFO] - Starting iteration 358. [2025-11-13 10:16:36,101][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:36,102][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:16:44,787][__main__][INFO] - Number of regex retries in iteration 358: 0 [2025-11-13 10:16:44,787][__main__][INFO] - agents played in iteration 358 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:16:45,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:45,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:45,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:45,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:16:45,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:16:45,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:16:46,060][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:16:46,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:16:46,682][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:16:47,009][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:16:47,333][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:16:47,662][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:16:47,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:16:48,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:16:48,637][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:16:48,962][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:16:49,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:16:49,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:16:49,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:16:50,263][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:16:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:16:50,914][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:16:51,239][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:16:51,565][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:16:51,891][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:16:52,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:16:52,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:16:52,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:16:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:16:53,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:16:53,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:16:54,167][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:16:54,493][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:16:54,819][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:16:55,142][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:16:55,467][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:16:55,793][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:16:56,117][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:16:56,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:16:57,160][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:16:57,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:16:57,882][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:16:57,884][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:16:58,875][__main__][INFO] - Iteration 359 took 22s (38.14% Gen, 57.51% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 45m 34s. Estimated total time: 18h 58m 44s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 47s. [2025-11-13 10:16:58,877][__main__][INFO] - Starting iteration 359. [2025-11-13 10:16:58,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:16:58,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:08,340][__main__][INFO] - Number of regex retries in iteration 359: 0 [2025-11-13 10:17:08,340][__main__][INFO] - agents played in iteration 359 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:17:08,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:08,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:08,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:08,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:08,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:08,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:09,635][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:09,931][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:10,256][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:10,906][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:11,229][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:11,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:11,877][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:12,200][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:12,529][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:12,851][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:13,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:13,827][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:14,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:14,479][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:15,133][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:15,460][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:15,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:16,761][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:17,086][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:17,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:17,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:18,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:19,360][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:19,684][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:20,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:20,732][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:21,485][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:21,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:21,492][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:22,659][__main__][INFO] - Iteration 360 took 23s (39.78% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 35m 24s. Estimated total time: 19h 48m 57s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 9s. [2025-11-13 10:17:22,661][__main__][INFO] - Starting iteration 360. [2025-11-13 10:17:22,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 10:17:22,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:32,566][__main__][INFO] - Number of regex retries in iteration 360: 0 [2025-11-13 10:17:32,567][__main__][INFO] - agents played in iteration 360 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:17:33,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:33,153][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:33,154][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:33,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:34,170][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:34,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:34,818][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:35,143][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:35,465][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:17:35,789][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:17:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:17:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:17:36,772][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:17:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:17:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:17:37,748][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:17:38,075][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:17:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:17:38,724][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:17:39,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:17:39,383][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:17:39,708][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:17:40,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:17:40,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:17:40,682][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:17:41,007][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:17:41,333][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:17:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:17:41,983][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:17:42,308][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:17:42,632][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:17:42,957][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:17:43,282][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:17:43,608][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:17:43,933][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:17:44,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:17:44,977][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:17:45,703][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:17:45,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:17:45,707][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:17:47,613][__main__][INFO] - Iteration 361 took 24s (39.69% Gen, 52.66% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 33m 31s. Estimated total time: 20h 47m 29s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 34s, 500 more iterations: 3h 27m 54s. [2025-11-13 10:17:47,615][__main__][INFO] - Starting iteration 361. [2025-11-13 10:17:47,618][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:17:47,618][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:17:57,111][__main__][INFO] - Number of regex retries in iteration 361: 0 [2025-11-13 10:17:57,111][__main__][INFO] - agents played in iteration 361 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:17:57,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:57,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:57,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:57,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:17:57,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:17:57,676][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:17:58,383][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:17:58,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:17:59,004][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:17:59,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:17:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:17:59,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:00,633][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:00,961][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:01,288][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:01,615][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:01,945][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:02,271][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:02,597][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:02,925][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:03,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:03,577][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:03,905][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:04,554][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:04,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:05,203][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:05,527][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:06,175][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:06,499][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:07,147][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:07,795][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:08,120][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:08,445][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:08,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:09,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:10,244][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:10,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:10,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:11,541][__main__][INFO] - Iteration 362 took 23s (39.68% Gen, 54.91% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 41m 52s. Estimated total time: 19h 56m 14s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 22s. [2025-11-13 10:18:11,544][__main__][INFO] - Starting iteration 362. [2025-11-13 10:18:11,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:11,548][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:20,815][__main__][INFO] - Number of regex retries in iteration 362: 0 [2025-11-13 10:18:20,816][__main__][INFO] - agents played in iteration 362 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:18:21,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:21,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:21,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:21,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:21,382][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:21,382][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:22,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:23,049][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:23,371][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:24,018][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:24,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:25,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:25,642][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:26,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:26,945][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:27,269][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:27,594][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:27,919][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:28,245][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:28,570][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:28,894][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:29,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:29,542][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:29,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:30,191][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:30,515][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:30,839][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:31,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:31,814][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:32,140][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:32,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:33,175][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:33,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:33,903][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:33,905][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:34,891][__main__][INFO] - Iteration 363 took 23s (39.70% Gen, 56.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 30s. Estimated total time: 19h 27m 16s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 32s. [2025-11-13 10:18:34,893][__main__][INFO] - Starting iteration 363. [2025-11-13 10:18:34,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:34,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:18:43,979][__main__][INFO] - Number of regex retries in iteration 363: 0 [2025-11-13 10:18:43,979][__main__][INFO] - agents played in iteration 363 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:18:44,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:44,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:44,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:44,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:18:44,528][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:18:44,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:18:45,246][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:18:45,543][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:18:45,868][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:18:46,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:18:46,522][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:18:46,846][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:18:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:18:47,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:18:47,820][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:18:48,143][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:18:48,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:18:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:18:49,117][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:18:49,443][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:18:49,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:18:50,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:18:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:18:50,749][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:18:51,074][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:18:51,399][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:18:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:18:52,049][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:18:52,373][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:18:52,699][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:18:53,025][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:18:53,351][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:18:53,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:18:54,000][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:18:54,326][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:18:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:18:54,975][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:18:55,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:18:55,626][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:18:56,352][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:18:57,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:18:57,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:18:57,081][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:18:58,067][__main__][INFO] - Iteration 364 took 23s (39.20% Gen, 56.54% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 3m 25s. Estimated total time: 19h 18m 34s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 5s. [2025-11-13 10:18:58,069][__main__][INFO] - Starting iteration 364. [2025-11-13 10:18:58,072][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:18:58,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:07,331][__main__][INFO] - Number of regex retries in iteration 364: 0 [2025-11-13 10:19:07,332][__main__][INFO] - agents played in iteration 364 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:19:07,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:07,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:07,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:07,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:07,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:07,887][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:08,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:09,225][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:09,550][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:09,874][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:10,196][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:11,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:11,495][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:12,469][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:13,117][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:13,771][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:14,096][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:14,420][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:14,747][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:15,072][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:15,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:15,721][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:16,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:16,371][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:16,697][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:17,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:17,347][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:17,672][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:17,997][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:18,322][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:18,646][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:18,971][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:19,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:20,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:20,399][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:20,401][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:21,363][__main__][INFO] - Iteration 365 took 23s (39.75% Gen, 56.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 1s. Estimated total time: 19h 24m 34s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 5s. [2025-11-13 10:19:21,365][__main__][INFO] - Starting iteration 365. [2025-11-13 10:19:21,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:21,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:30,348][__main__][INFO] - Number of regex retries in iteration 365: 0 [2025-11-13 10:19:30,349][__main__][INFO] - agents played in iteration 365 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:19:30,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:30,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:30,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:30,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:30,904][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:30,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:31,614][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:31,909][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:32,235][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:32,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:32,888][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:33,210][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:33,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:34,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:34,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:35,164][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:35,820][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:36,147][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:19:36,472][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:19:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:19:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:19:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:19:37,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:19:38,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:19:38,430][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:19:38,754][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:19:39,079][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:19:39,405][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:19:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:19:40,054][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:19:40,380][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:19:40,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:19:41,028][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:19:41,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:19:41,677][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:19:42,001][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:19:42,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:19:43,419][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:19:43,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:19:43,422][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:19:44,542][__main__][INFO] - Iteration 366 took 23s (38.75% Gen, 56.42% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 2m 47s. Estimated total time: 19h 18m 42s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 7s. [2025-11-13 10:19:44,544][__main__][INFO] - Starting iteration 366. [2025-11-13 10:19:44,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:19:44,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:19:54,123][__main__][INFO] - Number of regex retries in iteration 366: 0 [2025-11-13 10:19:54,124][__main__][INFO] - agents played in iteration 366 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:19:54,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:54,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:54,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:54,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:19:54,678][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:19:54,678][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:19:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:19:55,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:19:56,031][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:19:56,354][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:19:56,676][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:19:57,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:19:57,328][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:19:57,657][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:19:57,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:19:58,307][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:19:58,630][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:19:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:19:59,279][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:19:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:19:59,932][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:00,582][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:00,906][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:01,232][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:01,558][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:01,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:02,208][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:02,533][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:02,858][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:03,506][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:04,156][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:05,128][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:05,452][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:05,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:06,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:07,241][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:07,243][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:07,245][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:08,267][__main__][INFO] - Iteration 367 took 23s (40.37% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 45s. Estimated total time: 19h 46m 4s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 40s. [2025-11-13 10:20:08,269][__main__][INFO] - Starting iteration 367. [2025-11-13 10:20:08,272][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:08,273][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:17,903][__main__][INFO] - Number of regex retries in iteration 367: 0 [2025-11-13 10:20:17,903][__main__][INFO] - agents played in iteration 367 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:20:18,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:18,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:18,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:18,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:18,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:18,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:19,179][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:19,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:19,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:20,127][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:20,453][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:21,099][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:21,425][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:22,082][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:22,416][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:22,740][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:23,720][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:24,370][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:24,696][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:25,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:25,348][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:25,672][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:25,997][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:26,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:26,972][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:27,296][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:27,620][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:27,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:28,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:28,593][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:28,917][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:29,566][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:30,284][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:31,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:31,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:31,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:32,002][__main__][INFO] - Iteration 368 took 23s (40.58% Gen, 55.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 49s. Estimated total time: 19h 46m 32s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 45s. [2025-11-13 10:20:32,004][__main__][INFO] - Starting iteration 368. [2025-11-13 10:20:32,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:32,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:20:41,711][__main__][INFO] - Number of regex retries in iteration 368: 0 [2025-11-13 10:20:41,711][__main__][INFO] - agents played in iteration 368 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:20:42,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:20:42,269][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:20:42,269][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:20:42,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:20:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:20:43,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:20:43,926][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:20:44,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:20:44,574][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:20:44,898][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:20:45,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:20:45,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:20:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:20:46,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:20:46,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:20:46,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:20:47,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:20:47,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:20:47,831][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:20:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:20:48,479][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:20:48,804][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:20:49,129][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:20:49,454][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:20:49,780][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:20:50,106][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:20:50,432][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:20:50,756][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:20:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:20:51,406][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:20:51,732][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:20:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:20:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:20:52,707][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:20:53,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:20:53,358][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:20:54,066][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:20:54,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:20:54,795][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:20:54,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:20:55,820][__main__][INFO] - Iteration 369 took 23s (40.75% Gen, 54.95% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 34s. Estimated total time: 19h 50m 41s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 26s. [2025-11-13 10:20:55,822][__main__][INFO] - Starting iteration 369. [2025-11-13 10:20:55,826][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:20:55,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:05,292][__main__][INFO] - Number of regex retries in iteration 369: 0 [2025-11-13 10:21:05,293][__main__][INFO] - agents played in iteration 369 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:21:05,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:05,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:05,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:05,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:05,869][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:05,869][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:06,590][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:06,886][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:07,542][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:07,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:08,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:08,524][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:09,178][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:09,505][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:09,830][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:10,478][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:10,807][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:11,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:11,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:11,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:12,107][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:12,431][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:12,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:13,080][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:13,729][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:14,052][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:14,378][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:14,704][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:15,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:15,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:16,002][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:16,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:16,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:17,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:18,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:18,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:18,419][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:19,391][__main__][INFO] - Iteration 370 took 23s (40.17% Gen, 55.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 20m 49s. Estimated total time: 19h 38m 19s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 23s. [2025-11-13 10:21:19,393][__main__][INFO] - Starting iteration 370. [2025-11-13 10:21:19,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 10:21:19,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:28,269][__main__][INFO] - Number of regex retries in iteration 370: 0 [2025-11-13 10:21:28,269][__main__][INFO] - agents played in iteration 370 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:21:28,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:28,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:28,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:28,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:28,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:28,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:29,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:29,830][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:30,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:30,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:31,139][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:31,463][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:31,792][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:32,121][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:32,447][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:33,423][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:33,749][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:34,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:35,049][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:35,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:21:35,699][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:21:36,025][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:21:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:21:36,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:21:37,000][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:21:37,325][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:21:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:21:37,977][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:21:38,303][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:21:38,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:21:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:21:39,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:21:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:21:39,927][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:21:40,628][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:21:41,362][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:21:41,364][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:21:41,366][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:21:43,363][__main__][INFO] - Iteration 371 took 23s (37.01% Gen, 54.65% Train). Generation: 8s, Training: 13s. Estimated remaining time: 17h 40m 27s. Estimated total time: 19h 58m 21s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 43s. [2025-11-13 10:21:43,365][__main__][INFO] - Starting iteration 371. [2025-11-13 10:21:43,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:21:43,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:21:52,747][__main__][INFO] - Number of regex retries in iteration 371: 0 [2025-11-13 10:21:52,748][__main__][INFO] - agents played in iteration 371 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:21:53,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:21:53,308][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:21:53,308][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:21:54,017][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:21:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:21:54,637][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:21:54,961][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:21:55,286][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:21:55,611][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:21:55,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:21:56,261][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:21:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:21:56,915][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:21:57,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:21:57,575][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:21:57,899][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:21:58,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:21:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:21:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:21:59,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:21:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:21:59,850][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:00,173][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:00,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:00,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:01,795][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:02,447][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:02,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:03,097][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:03,422][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:03,750][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:04,074][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:04,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:05,110][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:05,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:05,847][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:05,849][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:06,817][__main__][INFO] - Iteration 372 took 23s (40.00% Gen, 55.87% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 10s. Estimated total time: 19h 32m 27s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 24s. [2025-11-13 10:22:06,819][__main__][INFO] - Starting iteration 372. [2025-11-13 10:22:06,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:06,823][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:16,248][__main__][INFO] - Number of regex retries in iteration 372: 0 [2025-11-13 10:22:16,249][__main__][INFO] - agents played in iteration 372 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:22:16,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:16,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:16,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:16,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:16,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:16,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:17,539][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:18,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:19,456][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:19,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:20,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:20,428][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:20,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:21,076][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:21,402][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:22,058][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:22,384][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:22,708][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:23,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:23,682][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:24,332][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:25,304][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:25,627][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:25,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:26,278][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:26,604][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:27,253][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:27,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:28,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:29,341][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:29,343][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:29,344][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:30,321][__main__][INFO] - Iteration 373 took 23s (40.11% Gen, 55.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 17s. Estimated total time: 19h 34m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 49s. [2025-11-13 10:22:30,323][__main__][INFO] - Starting iteration 373. [2025-11-13 10:22:30,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:30,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:22:39,778][__main__][INFO] - Number of regex retries in iteration 373: 0 [2025-11-13 10:22:39,779][__main__][INFO] - agents played in iteration 373 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:22:40,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:40,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:40,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:40,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:22:40,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:22:40,328][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:22:41,038][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:22:41,332][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:22:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:22:41,984][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:22:42,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:22:42,634][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:22:42,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:22:43,283][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:22:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:22:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:22:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:22:44,586][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:22:44,914][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:22:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:22:45,566][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:22:45,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:22:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:22:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:22:46,865][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:22:47,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:22:47,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:22:47,836][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:22:48,160][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:22:48,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:22:48,811][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:22:49,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:22:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:22:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:22:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:22:50,434][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:22:50,762][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:22:51,088][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:22:51,412][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:22:52,107][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:22:52,978][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:22:52,980][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:22:52,994][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:22:54,170][__main__][INFO] - Iteration 374 took 23s (39.64% Gen, 55.42% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 6s. Estimated total time: 19h 52m 10s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 44s, 500 more iterations: 3h 18m 41s. [2025-11-13 10:22:54,172][__main__][INFO] - Starting iteration 374. [2025-11-13 10:22:54,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:22:54,175][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:03,814][__main__][INFO] - Number of regex retries in iteration 374: 0 [2025-11-13 10:23:03,814][__main__][INFO] - agents played in iteration 374 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:23:04,271][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:04,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:04,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:04,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:04,373][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:04,373][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:05,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:05,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:05,738][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:06,065][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:06,394][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:07,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:07,374][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:08,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:08,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:08,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:08,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:09,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:09,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:10,295][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:10,620][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:11,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:11,914][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:12,562][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:12,886][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:13,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:13,535][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:14,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:14,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:15,157][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:15,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:16,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:16,915][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:16,917][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:16,919][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:17,912][__main__][INFO] - Iteration 375 took 23s (40.61% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 27m 25s. Estimated total time: 19h 46m 53s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 48s. [2025-11-13 10:23:17,914][__main__][INFO] - Starting iteration 375. [2025-11-13 10:23:17,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:17,919][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:27,240][__main__][INFO] - Number of regex retries in iteration 375: 0 [2025-11-13 10:23:27,241][__main__][INFO] - agents played in iteration 375 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:23:27,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:27,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:27,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:27,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:27,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:27,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:28,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:29,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:29,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:29,924][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:30,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:30,585][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:30,911][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:31,887][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:32,212][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:33,185][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:33,509][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:33,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:34,158][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:34,483][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:35,455][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:23:36,754][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:23:37,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:23:37,405][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:23:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:23:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:23:38,381][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:23:38,706][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:23:39,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:23:39,755][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:23:40,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:23:40,496][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:23:40,498][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:23:41,469][__main__][INFO] - Iteration 376 took 23s (39.58% Gen, 56.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 42s. Estimated total time: 19h 37m 34s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 15s, 500 more iterations: 3h 16m 15s. [2025-11-13 10:23:41,471][__main__][INFO] - Starting iteration 376. [2025-11-13 10:23:41,474][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:23:41,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:23:50,862][__main__][INFO] - Number of regex retries in iteration 376: 0 [2025-11-13 10:23:50,862][__main__][INFO] - agents played in iteration 376 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:23:51,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:23:51,429][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:23:51,429][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:23:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:23:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:23:52,783][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:23:53,108][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:23:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:23:53,766][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:23:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:23:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:23:54,740][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:23:55,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:23:55,387][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:23:55,712][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:23:56,036][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:23:56,360][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:23:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:23:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:23:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:23:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:23:57,983][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:23:58,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:23:58,630][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:23:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:23:59,278][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:23:59,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:23:59,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:00,253][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:00,577][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:00,900][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:01,550][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:01,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:02,200][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:02,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:03,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:03,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:03,968][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:03,970][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:04,973][__main__][INFO] - Iteration 377 took 23s (39.95% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 42s. Estimated total time: 19h 34m 58s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 49s. [2025-11-13 10:24:04,975][__main__][INFO] - Starting iteration 377. [2025-11-13 10:24:04,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:04,979][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:14,457][__main__][INFO] - Number of regex retries in iteration 377: 0 [2025-11-13 10:24:14,458][__main__][INFO] - agents played in iteration 377 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:24:14,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:14,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:14,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:15,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:15,029][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:15,029][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:16,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:16,714][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:17,368][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:17,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:18,025][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:18,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:18,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:19,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:19,652][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:19,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:20,302][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:20,624][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:21,275][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:21,599][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:21,924][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:22,248][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:22,572][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:22,896][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:23,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:23,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:23,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:24,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:24,522][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:24,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:25,173][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:25,501][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:25,826][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:26,153][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:26,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:27,621][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:27,623][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:27,624][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:28,574][__main__][INFO] - Iteration 378 took 23s (40.17% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 10s. Estimated total time: 19h 39m 49s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 38s. [2025-11-13 10:24:28,575][__main__][INFO] - Starting iteration 378. [2025-11-13 10:24:28,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:28,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:24:37,214][__main__][INFO] - Number of regex retries in iteration 378: 0 [2025-11-13 10:24:37,214][__main__][INFO] - agents played in iteration 378 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:24:37,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:24:37,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:24:37,779][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:24:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:24:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:24:39,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:24:39,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:24:39,779][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:24:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:24:40,432][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:24:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:24:41,082][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:24:41,409][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:24:41,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:24:42,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:24:42,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:24:42,709][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:24:43,033][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:24:43,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:24:43,682][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:24:44,006][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:24:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:24:44,653][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:24:44,978][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:24:45,303][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:24:45,627][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:24:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:24:46,276][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:24:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:24:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:24:47,249][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:24:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:24:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:24:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:24:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:24:48,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:24:49,587][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:24:50,303][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:24:50,305][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:24:50,306][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:24:51,418][__main__][INFO] - Iteration 379 took 22s (37.81% Gen, 57.32% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 40m 57s. Estimated total time: 19h 1m 59s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 19s. [2025-11-13 10:24:51,420][__main__][INFO] - Starting iteration 379. [2025-11-13 10:24:51,423][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:24:51,423][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:00,918][__main__][INFO] - Number of regex retries in iteration 379: 0 [2025-11-13 10:25:00,919][__main__][INFO] - agents played in iteration 379 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:25:01,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:01,478][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:01,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:02,209][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:02,505][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:03,483][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:03,808][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:04,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:04,460][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:05,761][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:06,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:06,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:07,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:07,710][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:08,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:08,358][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:08,683][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:09,009][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:09,657][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:10,307][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:10,634][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:10,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:11,608][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:11,935][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:12,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:13,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:14,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:14,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:14,058][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:15,224][__main__][INFO] - Iteration 380 took 23s (39.89% Gen, 55.20% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 28m 42s. Estimated total time: 19h 50m 8s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 21s. [2025-11-13 10:25:15,227][__main__][INFO] - Starting iteration 380. [2025-11-13 10:25:15,230][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 10:25:15,230][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:24,340][__main__][INFO] - Number of regex retries in iteration 380: 0 [2025-11-13 10:25:24,341][__main__][INFO] - agents played in iteration 380 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:25:24,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:24,922][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:24,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:25,617][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:25,911][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:26,236][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:26,561][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:26,886][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:27,860][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:28,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:28,510][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:28,837][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:29,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:29,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:29,809][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:30,133][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:30,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:30,781][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:31,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:32,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:32,730][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:33,055][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:33,379][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:33,705][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:34,030][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:34,354][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:34,680][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:35,004][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:35,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:25:35,661][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:25:35,988][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:25:36,811][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:25:37,560][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:25:37,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:25:37,563][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:25:39,454][__main__][INFO] - Iteration 381 took 24s (37.60% Gen, 54.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 49m 25s. Estimated total time: 20h 11m 15s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 52s. [2025-11-13 10:25:39,457][__main__][INFO] - Starting iteration 381. [2025-11-13 10:25:39,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:25:39,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:25:48,802][__main__][INFO] - Number of regex retries in iteration 381: 0 [2025-11-13 10:25:48,802][__main__][INFO] - agents played in iteration 381 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:25:49,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:49,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:49,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:49,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:25:49,365][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:25:49,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:25:50,068][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:25:50,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:25:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:25:51,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:25:51,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:25:51,661][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:25:51,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:25:52,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:25:52,635][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:25:52,961][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:25:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:25:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:25:53,932][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:25:54,257][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:25:54,580][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:25:54,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:25:55,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:25:55,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:25:55,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:25:56,206][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:25:56,530][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:25:56,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:25:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:25:57,502][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:25:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:25:58,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:25:58,473][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:25:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:25:59,123][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:25:59,451][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:25:59,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:00,104][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:00,433][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:01,175][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:01,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:01,908][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:01,910][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:02,950][__main__][INFO] - Iteration 382 took 23s (39.77% Gen, 55.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 20s. Estimated total time: 19h 34m 34s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 9s, 500 more iterations: 3h 15m 45s. [2025-11-13 10:26:02,952][__main__][INFO] - Starting iteration 382. [2025-11-13 10:26:02,981][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:02,981][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:12,383][__main__][INFO] - Number of regex retries in iteration 382: 0 [2025-11-13 10:26:12,384][__main__][INFO] - agents played in iteration 382 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:26:12,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:12,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:12,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:12,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:12,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:12,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:13,679][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:13,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:14,304][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:14,631][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:14,966][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:15,293][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:16,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:17,252][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:17,901][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:18,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:19,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:19,848][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:20,499][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:21,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:21,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:22,120][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:22,446][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:22,770][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:23,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:23,417][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:23,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:24,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:24,794][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:25,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:25,511][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:25,513][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:26,531][__main__][INFO] - Iteration 383 took 23s (39.88% Gen, 55.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 13s. Estimated total time: 19h 38m 50s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 28s. [2025-11-13 10:26:26,533][__main__][INFO] - Starting iteration 383. [2025-11-13 10:26:26,537][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:26,538][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:36,326][__main__][INFO] - Number of regex retries in iteration 383: 0 [2025-11-13 10:26:36,327][__main__][INFO] - agents played in iteration 383 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:26:36,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:36,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:36,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:26:37,618][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:26:37,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:26:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:26:38,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:26:38,890][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:26:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:26:39,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:26:39,867][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:26:40,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:26:40,522][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:26:40,847][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:26:41,172][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:26:41,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:26:41,821][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:26:42,146][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:26:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:26:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:26:43,119][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:26:43,444][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:26:43,768][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:26:44,092][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:26:44,415][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:26:44,739][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:26:45,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:26:45,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:26:45,715][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:26:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:26:46,365][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:26:46,691][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:26:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:26:47,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:26:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:26:47,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:26:48,719][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:26:49,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:26:49,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:26:49,469][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:26:50,469][__main__][INFO] - Iteration 384 took 23s (40.90% Gen, 54.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 35s. Estimated total time: 19h 56m 36s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 53s, 500 more iterations: 3h 19m 26s. [2025-11-13 10:26:50,471][__main__][INFO] - Starting iteration 384. [2025-11-13 10:26:50,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:26:50,475][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:26:59,165][__main__][INFO] - Number of regex retries in iteration 384: 0 [2025-11-13 10:26:59,166][__main__][INFO] - agents played in iteration 384 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:26:59,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:59,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:59,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:59,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:26:59,721][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:26:59,721][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:01,072][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:01,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:01,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:02,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:03,025][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:03,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:03,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:03,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:04,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:04,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:05,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:06,600][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:06,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:07,249][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:07,573][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:08,222][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:08,870][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:09,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:09,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:09,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:10,169][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:10,817][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:11,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:11,856][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:12,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:12,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:12,598][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:13,698][__main__][INFO] - Iteration 385 took 23s (37.42% Gen, 57.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 57m 49s. Estimated total time: 19h 21m 13s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 32s. [2025-11-13 10:27:13,701][__main__][INFO] - Starting iteration 385. [2025-11-13 10:27:13,704][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:13,704][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:22,570][__main__][INFO] - Number of regex retries in iteration 385: 0 [2025-11-13 10:27:22,571][__main__][INFO] - agents played in iteration 385 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:27:23,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:23,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:23,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:23,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:23,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:23,132][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:23,840][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:24,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:25,111][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:25,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:25,762][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:26,086][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:26,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:27,387][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:27,713][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:29,012][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:29,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:29,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:30,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:30,635][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:30,959][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:31,282][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:31,608][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:31,933][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:32,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:32,582][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:32,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:33,231][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:33,554][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:33,878][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:34,204][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:34,903][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:35,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:35,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:35,630][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:36,677][__main__][INFO] - Iteration 386 took 22s (38.59% Gen, 56.84% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 44m 56s. Estimated total time: 19h 8m 43s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 27s. [2025-11-13 10:27:36,680][__main__][INFO] - Starting iteration 386. [2025-11-13 10:27:36,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:36,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:27:45,824][__main__][INFO] - Number of regex retries in iteration 386: 0 [2025-11-13 10:27:45,824][__main__][INFO] - agents played in iteration 386 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:27:46,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:46,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:46,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:46,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:27:46,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:27:46,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:27:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:27:47,406][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:27:47,734][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:27:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:27:48,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:27:48,714][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:27:49,040][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:27:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:27:49,697][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:27:50,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:27:50,343][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:27:50,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:27:50,994][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:27:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:27:51,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:27:51,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:27:52,295][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:27:52,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:27:52,944][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:27:53,269][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:27:53,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:27:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:27:54,244][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:27:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:27:54,895][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:27:55,219][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:27:55,544][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:27:55,868][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:27:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:27:56,520][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:27:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:27:57,168][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:27:57,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:27:58,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:27:58,929][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:27:58,930][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:27:58,932][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:27:59,907][__main__][INFO] - Iteration 387 took 23s (39.36% Gen, 56.44% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 57m 5s. Estimated total time: 19h 21m 15s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 32s. [2025-11-13 10:27:59,909][__main__][INFO] - Starting iteration 387. [2025-11-13 10:27:59,913][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:27:59,914][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:08,688][__main__][INFO] - Number of regex retries in iteration 387: 0 [2025-11-13 10:28:08,689][__main__][INFO] - agents played in iteration 387 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:28:09,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:09,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:09,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:09,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:09,245][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:09,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:09,955][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:10,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:10,901][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:12,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:12,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:12,853][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:13,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:13,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:13,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:14,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:14,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:15,131][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:15,455][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:15,783][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:16,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:16,433][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:16,758][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:17,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:17,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:18,056][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:18,381][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:18,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:19,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:19,356][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:19,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:20,008][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:20,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:21,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:21,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:21,767][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:21,769][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:22,743][__main__][INFO] - Iteration 388 took 22s (38.43% Gen, 57.29% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 36m 59s. Estimated total time: 19h 1m 33s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 3s, 500 more iterations: 3h 10m 15s. [2025-11-13 10:28:22,745][__main__][INFO] - Starting iteration 388. [2025-11-13 10:28:22,749][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:22,749][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:31,988][__main__][INFO] - Number of regex retries in iteration 388: 0 [2025-11-13 10:28:31,988][__main__][INFO] - agents played in iteration 388 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:28:32,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:32,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:32,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:32,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:32,540][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:32,540][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:33,250][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:33,544][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:34,198][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:34,521][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:34,845][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:35,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:35,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:35,816][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:36,144][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:28:37,118][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:28:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:28:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:28:38,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:28:38,419][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:28:38,746][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:28:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:28:39,394][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:28:39,719][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:28:40,043][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:28:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:28:40,693][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:28:41,019][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:28:41,344][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:28:41,668][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:28:41,993][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:28:42,318][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:28:42,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:28:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:28:43,292][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:28:43,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:28:44,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:28:45,051][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:28:45,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:28:45,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:28:46,034][__main__][INFO] - Iteration 389 took 23s (39.67% Gen, 56.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 19s. Estimated total time: 19h 24m 16s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 2s. [2025-11-13 10:28:46,036][__main__][INFO] - Starting iteration 389. [2025-11-13 10:28:46,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:28:46,040][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:28:55,147][__main__][INFO] - Number of regex retries in iteration 389: 0 [2025-11-13 10:28:55,147][__main__][INFO] - agents played in iteration 389 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:28:55,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:55,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:55,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:55,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:28:55,697][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:28:55,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:28:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:28:56,721][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:28:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:28:57,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:28:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:28:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:28:58,346][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:28:58,671][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:28:58,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:28:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:28:59,646][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:28:59,976][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:00,306][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:00,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:01,289][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:01,615][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:01,940][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:02,264][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:02,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:02,911][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:03,235][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:03,558][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:03,883][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:04,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:04,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:04,856][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:05,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:05,507][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:05,831][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:06,155][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:06,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:07,515][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:08,252][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:08,256][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:08,257][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:09,220][__main__][INFO] - Iteration 390 took 23s (39.29% Gen, 56.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 53m 44s. Estimated total time: 19h 19m 4s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 38s, 500 more iterations: 3h 13m 10s. [2025-11-13 10:29:09,222][__main__][INFO] - Starting iteration 390. [2025-11-13 10:29:09,225][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 10:29:09,225][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:18,958][__main__][INFO] - Number of regex retries in iteration 390: 0 [2025-11-13 10:29:18,959][__main__][INFO] - agents played in iteration 390 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:29:19,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:19,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:19,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:19,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:19,528][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:19,529][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:20,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:21,522][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:21,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:22,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:22,498][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:22,824][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:23,149][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:23,473][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:23,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:24,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:24,453][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:24,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:25,432][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:25,757][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:26,081][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:26,407][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:27,382][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:28,355][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:29,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:29,330][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:29,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:29,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:30,305][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:30,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:31,358][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:32,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:32,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:32,120][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:34,119][__main__][INFO] - Iteration 391 took 24s (39.10% Gen, 52.87% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 59s. Estimated total time: 20h 44m 44s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 29s, 500 more iterations: 3h 27m 27s. [2025-11-13 10:29:34,121][__main__][INFO] - Starting iteration 391. [2025-11-13 10:29:34,123][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:29:34,124][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:29:43,661][__main__][INFO] - Number of regex retries in iteration 391: 0 [2025-11-13 10:29:43,661][__main__][INFO] - agents played in iteration 391 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:29:44,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:44,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:44,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:44,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:29:44,237][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:29:44,238][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:29:44,920][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:29:45,213][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:29:45,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:29:45,860][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:29:46,183][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:29:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:29:46,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:29:47,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:29:47,482][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:29:47,808][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:29:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:29:48,460][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:29:48,784][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:29:49,110][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:29:49,441][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:29:49,765][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:29:50,090][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:29:50,415][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:29:50,739][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:29:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:29:51,388][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:29:51,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:29:52,039][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:29:52,364][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:29:52,688][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:29:53,013][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:29:53,338][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:29:53,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:29:53,990][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:29:54,315][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:29:54,639][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:29:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:29:55,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:29:56,018][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:29:56,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:29:56,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:29:56,794][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:29:57,949][__main__][INFO] - Iteration 392 took 23s (40.03% Gen, 55.12% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 25m 10s. Estimated total time: 19h 51m 19s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 33s. [2025-11-13 10:29:57,951][__main__][INFO] - Starting iteration 392. [2025-11-13 10:29:57,954][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:29:57,955][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:07,727][__main__][INFO] - Number of regex retries in iteration 392: 0 [2025-11-13 10:30:07,728][__main__][INFO] - agents played in iteration 392 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:30:08,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:08,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:08,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:09,012][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:09,629][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:10,926][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:11,256][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:11,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:11,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:13,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:13,866][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:14,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:14,838][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:15,162][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:15,486][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:15,811][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:16,459][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:16,782][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:17,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:17,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:18,081][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:18,406][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:19,382][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:20,110][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:20,862][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:20,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:20,865][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:21,874][__main__][INFO] - Iteration 393 took 23s (40.86% Gen, 54.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 27s. Estimated total time: 19h 55m 59s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 19s. [2025-11-13 10:30:21,876][__main__][INFO] - Starting iteration 393. [2025-11-13 10:30:21,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:21,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:30,747][__main__][INFO] - Number of regex retries in iteration 393: 0 [2025-11-13 10:30:30,747][__main__][INFO] - agents played in iteration 393 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:30:31,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:31,314][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:31,314][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:32,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:32,333][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:32,657][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:33,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:33,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:33,953][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:34,276][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:34,927][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:35,252][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:35,582][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:35,912][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:30:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:30:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:30:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:30:37,219][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:30:37,545][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:30:37,869][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:30:38,193][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:30:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:30:38,843][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:30:39,168][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:30:39,493][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:30:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:30:40,142][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:30:40,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:30:40,791][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:30:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:30:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:30:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:30:42,091][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:30:42,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:30:43,145][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:30:43,945][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:30:43,947][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:30:43,948][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:30:45,024][__main__][INFO] - Iteration 394 took 23s (38.31% Gen, 57.03% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 50m 21s. Estimated total time: 19h 17m 17s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 52s. [2025-11-13 10:30:45,026][__main__][INFO] - Starting iteration 394. [2025-11-13 10:30:45,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:30:45,030][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:30:54,525][__main__][INFO] - Number of regex retries in iteration 394: 0 [2025-11-13 10:30:54,526][__main__][INFO] - agents played in iteration 394 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:30:54,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:30:55,090][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:30:55,090][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:30:55,809][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:30:56,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:30:56,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:30:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:30:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:30:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:30:57,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:30:58,052][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:30:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:30:58,706][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:30:59,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:30:59,360][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:30:59,687][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:00,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:00,665][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:00,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:01,313][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:01,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:03,260][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:03,584][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:04,555][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:04,879][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:05,205][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:05,530][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:05,855][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:06,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:06,903][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:07,643][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:07,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:07,647][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:08,692][__main__][INFO] - Iteration 395 took 23s (40.13% Gen, 55.45% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 15m 51s. Estimated total time: 19h 43m 11s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 11s. [2025-11-13 10:31:08,694][__main__][INFO] - Starting iteration 395. [2025-11-13 10:31:08,697][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:08,698][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:18,336][__main__][INFO] - Number of regex retries in iteration 395: 0 [2025-11-13 10:31:18,337][__main__][INFO] - agents played in iteration 395 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:31:18,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:18,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:18,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:18,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:18,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:18,903][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:19,919][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:20,891][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:21,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:21,539][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:21,863][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:22,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:22,837][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:23,163][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:23,487][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:23,812][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:24,136][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:24,784][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:25,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:25,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:26,408][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:26,734][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:27,383][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:28,033][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:28,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:28,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:29,005][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:29,330][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:29,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:30,686][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:31,421][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:31,422][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:31,424][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:32,403][__main__][INFO] - Iteration 396 took 23s (40.66% Gen, 55.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 36s. Estimated total time: 19h 45m 19s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 33s. [2025-11-13 10:31:32,405][__main__][INFO] - Starting iteration 396. [2025-11-13 10:31:32,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:32,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:31:42,308][__main__][INFO] - Number of regex retries in iteration 396: 0 [2025-11-13 10:31:42,309][__main__][INFO] - agents played in iteration 396 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:31:42,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:31:42,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:31:42,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:31:43,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:31:43,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:31:44,247][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:31:44,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:31:44,905][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:31:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:31:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:31:45,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:31:46,216][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:31:46,545][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:31:46,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:31:47,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:31:47,520][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:31:47,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:31:48,168][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:31:48,493][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:31:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:31:49,143][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:31:49,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:31:49,792][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:31:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:31:50,441][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:31:50,766][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:31:51,091][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:31:51,416][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:31:51,740][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:31:52,066][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:31:52,390][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:31:52,716][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:31:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:31:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:31:53,693][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:31:54,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:31:54,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:31:55,499][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:31:55,500][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:31:55,502][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:31:56,530][__main__][INFO] - Iteration 397 took 24s (41.04% Gen, 54.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 37m 57s. Estimated total time: 20h 6m 4s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 12s, 500 more iterations: 3h 21m 0s. [2025-11-13 10:31:56,532][__main__][INFO] - Starting iteration 397. [2025-11-13 10:31:56,535][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:31:56,535][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:05,785][__main__][INFO] - Number of regex retries in iteration 397: 0 [2025-11-13 10:32:05,785][__main__][INFO] - agents played in iteration 397 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:32:06,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:06,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:06,347][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:07,351][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:07,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:07,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:08,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:08,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:09,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:09,620][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:09,949][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:10,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:10,600][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:11,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:11,901][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:12,227][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:12,551][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:12,875][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:13,201][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:14,175][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:14,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:15,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:15,475][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:15,800][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:16,125][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:16,451][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:16,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:17,098][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:17,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:18,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:18,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:18,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:18,872][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:19,840][__main__][INFO] - Iteration 398 took 23s (39.69% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 45s. Estimated total time: 19h 25m 16s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 50s, 500 more iterations: 3h 14m 12s. [2025-11-13 10:32:19,842][__main__][INFO] - Starting iteration 398. [2025-11-13 10:32:19,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:19,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:29,340][__main__][INFO] - Number of regex retries in iteration 398: 0 [2025-11-13 10:32:29,340][__main__][INFO] - agents played in iteration 398 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:32:29,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:30,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:30,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:31,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:31,898][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:32,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:32,547][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:33,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:34,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:34,510][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:34,834][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:35,159][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:32:36,456][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:32:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:32:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:32:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:32:37,754][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:32:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:32:38,403][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:32:38,726][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:32:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:32:39,375][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:32:39,700][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:32:40,024][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:32:40,349][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:32:40,676][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:32:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:32:41,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:32:42,043][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:32:42,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:32:42,785][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:32:42,787][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:32:43,884][__main__][INFO] - Iteration 399 took 24s (39.49% Gen, 55.94% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 33m 3s. Estimated total time: 20h 1m 58s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 3s, 500 more iterations: 3h 20m 19s. [2025-11-13 10:32:43,887][__main__][INFO] - Starting iteration 399. [2025-11-13 10:32:43,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:32:43,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:32:53,301][__main__][INFO] - Number of regex retries in iteration 399: 0 [2025-11-13 10:32:53,301][__main__][INFO] - agents played in iteration 399 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:32:53,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:32:53,877][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:32:53,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:32:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:32:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:32:55,191][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:32:55,522][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:32:55,851][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:32:56,183][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:32:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:32:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:32:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:32:57,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:32:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:32:58,154][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:32:58,479][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:32:58,802][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:32:59,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:32:59,452][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:32:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:00,099][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:00,424][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:00,747][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:01,070][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:01,395][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:02,371][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:03,019][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:03,344][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:03,994][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:04,319][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:04,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:04,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:05,697][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:06,450][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:06,451][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:06,453][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:07,456][__main__][INFO] - Iteration 400 took 23s (39.93% Gen, 55.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 9m 1s. Estimated total time: 19h 38m 20s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 23s. [2025-11-13 10:33:07,458][__main__][INFO] - Starting iteration 400. [2025-11-13 10:33:07,462][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 10:33:07,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:16,833][__main__][INFO] - Number of regex retries in iteration 400: 0 [2025-11-13 10:33:16,833][__main__][INFO] - agents played in iteration 400 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:33:17,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:17,419][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:17,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:18,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:19,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:19,422][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:19,746][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:20,071][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:20,394][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:20,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:21,047][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:22,026][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:22,350][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:22,675][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:23,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:23,326][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:23,649][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:23,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:24,300][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:24,624][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:24,948][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:25,272][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:25,597][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:26,248][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:26,573][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:26,897][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:27,223][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:27,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:28,200][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:28,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:29,258][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:30,000][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:30,001][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:30,003][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:31,905][__main__][INFO] - Iteration 401 took 24s (38.33% Gen, 53.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 52m 32s. Estimated total time: 20h 22m 15s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 44s, 500 more iterations: 3h 23m 42s. [2025-11-13 10:33:31,907][__main__][INFO] - Starting iteration 401. [2025-11-13 10:33:31,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:33:31,910][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:33:41,553][__main__][INFO] - Number of regex retries in iteration 401: 0 [2025-11-13 10:33:41,554][__main__][INFO] - agents played in iteration 401 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:33:42,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:42,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:42,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:42,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:33:42,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:33:42,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:33:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:33:43,484][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:33:43,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:33:44,139][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:33:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:33:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:33:45,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:33:45,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:33:45,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:33:46,091][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:33:46,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:33:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:33:47,066][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:33:47,391][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:33:47,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:33:48,038][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:33:48,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:33:48,689][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:33:49,015][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:33:49,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:33:49,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:33:49,990][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:33:50,314][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:33:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:33:50,965][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:33:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:33:51,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:33:51,945][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:33:52,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:33:52,595][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:33:52,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:33:53,242][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:33:53,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:33:54,302][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:33:55,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:33:55,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:33:55,046][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:33:55,936][__main__][INFO] - Iteration 402 took 24s (40.14% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 31m 13s. Estimated total time: 20h 1m 20s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 2s, 500 more iterations: 3h 20m 13s. [2025-11-13 10:33:55,938][__main__][INFO] - Starting iteration 402. [2025-11-13 10:33:55,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:33:55,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:04,803][__main__][INFO] - Number of regex retries in iteration 402: 0 [2025-11-13 10:34:04,803][__main__][INFO] - agents played in iteration 402 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:34:05,265][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:05,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:05,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:05,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:05,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:05,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:06,399][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:06,695][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:07,025][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:07,355][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:07,685][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:08,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:08,990][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:09,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:10,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:10,617][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:10,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:11,266][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:11,592][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:12,565][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:12,889][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:13,213][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:13,537][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:13,863][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:14,837][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:15,162][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:15,488][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:16,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:16,465][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:16,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:17,511][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:18,244][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:18,246][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:18,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:19,290][__main__][INFO] - Iteration 403 took 23s (37.95% Gen, 57.58% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 56m 59s. Estimated total time: 19h 27m 29s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 34s. [2025-11-13 10:34:19,304][__main__][INFO] - Starting iteration 403. [2025-11-13 10:34:19,308][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:19,309][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:28,181][__main__][INFO] - Number of regex retries in iteration 403: 0 [2025-11-13 10:34:28,182][__main__][INFO] - agents played in iteration 403 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:34:28,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:28,745][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:28,746][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:29,458][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:29,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:30,077][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:30,401][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:31,055][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:31,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:31,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:32,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:32,362][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:32,687][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:33,012][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:33,336][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:33,661][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:34,311][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:34,961][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:35,287][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:35,611][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:35,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:34:36,586][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:34:36,910][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:34:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:34:37,560][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:34:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:34:38,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:34:38,535][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:34:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:34:39,185][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:34:39,511][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:34:39,837][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:34:40,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:34:41,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:34:41,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:34:41,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:34:42,339][__main__][INFO] - Iteration 404 took 23s (38.52% Gen, 56.92% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 40m 46s. Estimated total time: 19h 11m 39s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 23s, 500 more iterations: 3h 11m 56s. [2025-11-13 10:34:42,341][__main__][INFO] - Starting iteration 404. [2025-11-13 10:34:42,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:34:42,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:34:51,661][__main__][INFO] - Number of regex retries in iteration 404: 0 [2025-11-13 10:34:51,662][__main__][INFO] - agents played in iteration 404 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:34:52,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:52,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:52,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:52,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:34:52,227][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:34:52,227][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:34:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:34:53,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:34:53,536][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:34:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:34:54,185][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:34:54,509][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:34:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:34:55,162][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:34:55,489][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:34:55,813][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:34:56,140][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:34:56,465][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:34:56,791][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:34:57,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:34:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:34:57,768][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:34:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:34:58,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:34:58,741][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:34:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:34:59,389][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:34:59,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:00,038][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:00,364][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:00,687][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:01,335][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:01,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:01,988][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:02,312][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:02,643][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:02,970][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:03,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:04,022][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:04,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:04,747][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:04,749][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:06,004][__main__][INFO] - Iteration 405 took 23s (39.38% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 48s. Estimated total time: 19h 43m 5s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 10s. [2025-11-13 10:35:06,006][__main__][INFO] - Starting iteration 405. [2025-11-13 10:35:06,010][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:06,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:15,060][__main__][INFO] - Number of regex retries in iteration 405: 0 [2025-11-13 10:35:15,061][__main__][INFO] - agents played in iteration 405 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:35:15,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:15,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:15,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:15,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:15,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:15,622][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:16,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:16,601][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:16,926][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:17,249][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:17,574][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:17,898][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:18,223][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:18,548][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:18,871][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:19,198][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:20,501][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:20,826][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:21,150][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:21,801][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:22,451][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:23,100][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:23,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:23,749][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:24,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:24,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:24,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:25,374][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:25,697][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:26,346][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:26,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:27,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:28,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:28,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:28,115][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:29,157][__main__][INFO] - Iteration 406 took 23s (39.09% Gen, 56.39% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 45m 45s. Estimated total time: 19h 17m 25s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 54s. [2025-11-13 10:35:29,159][__main__][INFO] - Starting iteration 406. [2025-11-13 10:35:29,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:29,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:35:37,817][__main__][INFO] - Number of regex retries in iteration 406: 0 [2025-11-13 10:35:37,817][__main__][INFO] - agents played in iteration 406 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:35:38,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:35:38,381][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:35:38,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:35:39,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:35:39,413][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:35:39,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:35:40,061][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:35:40,387][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:35:40,709][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:35:41,034][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:35:41,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:35:41,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:35:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:35:42,336][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:35:42,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:35:42,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:35:43,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:35:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:35:43,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:35:44,292][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:35:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:35:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:35:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:35:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:35:45,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:35:46,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:35:46,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:35:46,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:35:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:35:47,536][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:35:47,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:35:48,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:35:48,509][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:35:48,834][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:35:49,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:35:49,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:35:50,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:35:50,932][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:35:50,978][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:35:50,980][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:35:51,969][__main__][INFO] - Iteration 407 took 22s (37.94% Gen, 57.71% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 28m 21s. Estimated total time: 19h 0m 24s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 0s, 500 more iterations: 3h 10m 4s. [2025-11-13 10:35:51,971][__main__][INFO] - Starting iteration 407. [2025-11-13 10:35:51,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:35:51,975][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:00,995][__main__][INFO] - Number of regex retries in iteration 407: 0 [2025-11-13 10:36:00,996][__main__][INFO] - agents played in iteration 407 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:36:01,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:01,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:01,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:02,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:03,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:03,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:04,198][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:04,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:04,848][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:05,172][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:05,498][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:05,826][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:06,152][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:06,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:08,755][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:09,081][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:09,406][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:09,730][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:10,055][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:10,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:11,032][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:11,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:12,006][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:12,330][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:12,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:13,376][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:14,121][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:14,123][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:14,125][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:15,183][__main__][INFO] - Iteration 408 took 23s (38.87% Gen, 56.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 2s. Estimated total time: 19h 20m 28s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 24s. [2025-11-13 10:36:15,185][__main__][INFO] - Starting iteration 408. [2025-11-13 10:36:15,188][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:15,189][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:24,150][__main__][INFO] - Number of regex retries in iteration 408: 0 [2025-11-13 10:36:24,150][__main__][INFO] - agents played in iteration 408 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:36:24,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:24,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:24,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:24,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:24,708][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:24,708][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:25,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:25,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:26,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:26,374][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:26,696][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:27,347][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:27,672][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:28,000][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:29,303][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:29,633][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:29,962][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:30,610][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:30,936][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:31,262][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:31,913][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:32,238][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:32,562][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:32,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:33,537][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:33,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:34,188][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:35,488][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:35,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:36:36,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:36:37,271][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:36:37,273][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:36:37,275][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:36:38,318][__main__][INFO] - Iteration 409 took 23s (38.74% Gen, 56.74% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 43m 43s. Estimated total time: 19h 16m 32s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 33s, 500 more iterations: 3h 12m 45s. [2025-11-13 10:36:38,320][__main__][INFO] - Starting iteration 409. [2025-11-13 10:36:38,323][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:36:38,324][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:36:47,585][__main__][INFO] - Number of regex retries in iteration 409: 0 [2025-11-13 10:36:47,586][__main__][INFO] - agents played in iteration 409 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:36:48,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:36:48,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:36:48,152][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:36:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:36:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:36:49,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:36:49,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:36:50,159][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:36:50,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:36:50,817][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:36:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:36:51,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:36:51,793][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:36:52,118][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:36:52,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:36:52,770][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:36:53,094][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:36:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:36:53,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:36:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:36:54,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:36:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:36:55,049][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:36:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:36:55,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:36:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:36:56,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:36:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:36:56,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:36:57,322][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:36:57,648][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:36:57,973][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:36:58,299][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:36:58,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:36:58,948][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:36:59,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:00,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:00,783][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:00,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:00,786][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:01,785][__main__][INFO] - Iteration 410 took 23s (39.48% Gen, 56.26% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 55s. Estimated total time: 19h 33m 7s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 31s. [2025-11-13 10:37:01,787][__main__][INFO] - Starting iteration 410. [2025-11-13 10:37:01,791][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 10:37:01,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:11,257][__main__][INFO] - Number of regex retries in iteration 410: 0 [2025-11-13 10:37:11,257][__main__][INFO] - agents played in iteration 410 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:37:11,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:11,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:11,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:12,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:13,176][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:13,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:13,825][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:14,148][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:14,800][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:15,126][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:15,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:15,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:16,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:16,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:16,757][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:17,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:17,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:17,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:18,056][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:18,380][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:18,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:19,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:19,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:20,326][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:20,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:20,976][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:21,300][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:21,626][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:21,949][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:22,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:22,600][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:22,930][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:23,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:23,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:24,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:24,739][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:24,740][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:26,501][__main__][INFO] - Iteration 411 took 24s (38.31% Gen, 54.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 1m 57s. Estimated total time: 20h 35m 35s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 11s, 500 more iterations: 3h 25m 55s. [2025-11-13 10:37:26,504][__main__][INFO] - Starting iteration 411. [2025-11-13 10:37:26,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:37:26,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:37:36,595][__main__][INFO] - Number of regex retries in iteration 411: 0 [2025-11-13 10:37:36,596][__main__][INFO] - agents played in iteration 411 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:37:37,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:37,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:37,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:37,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:37:37,143][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:37:37,144][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:37:37,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:37:38,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:37:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:37:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:37:39,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:37:39,426][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:37:39,754][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:37:40,077][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:37:40,401][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:37:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:37:41,053][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:37:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:37:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:37:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:37:42,360][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:37:42,684][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:37:43,008][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:37:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:37:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:37:43,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:37:44,309][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:37:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:37:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:37:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:37:45,609][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:37:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:37:46,258][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:37:46,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:37:46,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:37:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:37:47,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:37:47,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:37:48,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:37:48,920][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:37:49,647][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:37:49,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:37:49,650][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:37:51,003][__main__][INFO] - Iteration 412 took 24s (41.18% Gen, 53.29% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 50m 49s. Estimated total time: 20h 24m 51s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 49s, 500 more iterations: 3h 24m 8s. [2025-11-13 10:37:51,007][__main__][INFO] - Starting iteration 412. [2025-11-13 10:37:51,009][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:37:51,010][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:00,041][__main__][INFO] - Number of regex retries in iteration 412: 0 [2025-11-13 10:38:00,042][__main__][INFO] - agents played in iteration 412 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:38:00,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:00,606][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:00,606][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:01,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:01,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:01,965][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:02,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:02,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:02,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:03,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:03,912][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:04,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:04,566][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:04,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:05,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:05,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:05,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:06,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:06,518][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:06,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:07,492][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:07,817][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:08,141][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:09,114][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:09,439][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:09,764][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:10,087][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:11,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:11,389][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:11,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:12,423][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:13,146][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:13,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:13,149][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:14,214][__main__][INFO] - Iteration 413 took 23s (38.92% Gen, 56.49% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 45m 51s. Estimated total time: 19h 20m 16s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 22s. [2025-11-13 10:38:14,215][__main__][INFO] - Starting iteration 413. [2025-11-13 10:38:14,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:14,219][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:23,277][__main__][INFO] - Number of regex retries in iteration 413: 0 [2025-11-13 10:38:23,278][__main__][INFO] - agents played in iteration 413 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:38:23,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:23,876][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:23,877][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:24,607][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:24,901][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:25,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:25,877][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:26,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:26,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:26,860][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:28,164][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:29,141][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:29,791][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:30,115][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:30,440][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:30,765][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:31,090][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:31,417][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:32,390][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:33,038][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:33,686][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:34,009][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:34,334][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:34,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:35,724][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:38:36,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:38:36,478][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:38:36,479][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:38:37,451][__main__][INFO] - Iteration 414 took 23s (38.99% Gen, 56.82% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 46m 50s. Estimated total time: 19h 21m 38s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 36s. [2025-11-13 10:38:37,453][__main__][INFO] - Starting iteration 414. [2025-11-13 10:38:37,456][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:38:37,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:38:47,007][__main__][INFO] - Number of regex retries in iteration 414: 0 [2025-11-13 10:38:47,008][__main__][INFO] - agents played in iteration 414 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:38:47,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:38:47,570][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:38:47,571][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:38:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:38:48,599][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:38:48,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:38:49,246][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:38:49,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:38:49,896][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:38:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:38:50,544][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:38:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:38:51,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:38:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:38:51,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:38:52,165][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:38:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:38:52,813][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:38:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:38:53,463][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:38:53,787][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:38:54,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:38:54,435][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:38:54,759][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:38:55,083][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:38:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:38:55,732][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:38:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:38:56,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:38:56,705][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:38:57,029][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:38:57,353][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:38:57,677][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:38:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:38:58,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:38:58,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:38:59,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:00,111][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:00,113][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:00,114][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:01,254][__main__][INFO] - Iteration 415 took 23s (40.13% Gen, 55.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 45s. Estimated total time: 19h 49m 57s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 19s. [2025-11-13 10:39:01,256][__main__][INFO] - Starting iteration 415. [2025-11-13 10:39:01,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:01,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:11,164][__main__][INFO] - Number of regex retries in iteration 415: 0 [2025-11-13 10:39:11,165][__main__][INFO] - agents played in iteration 415 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:39:11,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:11,731][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:11,731][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:12,440][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:12,734][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:13,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:13,384][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:13,712][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:14,037][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:14,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:14,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:15,021][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:15,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:15,998][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:16,973][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:17,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:17,620][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:17,945][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:18,593][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:19,242][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:19,568][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:19,892][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:20,215][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:20,539][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:20,865][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:21,191][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:21,842][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:22,167][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:22,814][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:23,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:24,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:24,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:24,261][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:25,195][__main__][INFO] - Iteration 416 took 23s (41.38% Gen, 54.71% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 21m 16s. Estimated total time: 19h 56m 52s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 53s, 500 more iterations: 3h 19m 28s. [2025-11-13 10:39:25,198][__main__][INFO] - Starting iteration 416. [2025-11-13 10:39:25,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:25,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:34,059][__main__][INFO] - Number of regex retries in iteration 416: 0 [2025-11-13 10:39:34,060][__main__][INFO] - agents played in iteration 416 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:39:34,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:34,624][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:34,625][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:35,325][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:35,618][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:35,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:36,267][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:39:36,917][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:39:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:39:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:39:37,889][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:39:38,214][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:39:38,540][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:39:38,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:39:39,192][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:39:39,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:39:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:39:40,168][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:39:40,493][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:39:40,818][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:39:41,143][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:39:41,466][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:39:41,790][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:39:42,113][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:39:42,438][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:39:42,762][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:39:43,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:39:43,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:39:43,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:39:44,062][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:39:44,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:39:44,713][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:39:45,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:39:45,362][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:39:45,688][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:39:46,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:39:47,140][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:39:47,142][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:39:47,143][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:39:48,178][__main__][INFO] - Iteration 417 took 22s (38.55% Gen, 56.94% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 32m 57s. Estimated total time: 19h 8m 56s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 17s, 500 more iterations: 3h 11m 29s. [2025-11-13 10:39:48,180][__main__][INFO] - Starting iteration 417. [2025-11-13 10:39:48,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:39:48,184][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:39:57,292][__main__][INFO] - Number of regex retries in iteration 417: 0 [2025-11-13 10:39:57,293][__main__][INFO] - agents played in iteration 417 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:39:57,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:57,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:57,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:57,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:39:57,863][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:39:57,864][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:39:58,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:39:58,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:39:59,222][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:39:59,547][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:39:59,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:00,196][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:00,521][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:00,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:01,168][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:01,494][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:01,818][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:02,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:03,120][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:03,445][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:03,769][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:04,093][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:04,418][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:04,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:05,066][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:05,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:06,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:06,688][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:07,661][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:07,985][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:08,309][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:08,635][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:08,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:09,672][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:10,403][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:10,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:10,406][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:11,512][__main__][INFO] - Iteration 418 took 23s (39.04% Gen, 56.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 50m 8s. Estimated total time: 19h 26m 30s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 25s. [2025-11-13 10:40:11,514][__main__][INFO] - Starting iteration 418. [2025-11-13 10:40:11,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:11,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:20,494][__main__][INFO] - Number of regex retries in iteration 418: 0 [2025-11-13 10:40:20,495][__main__][INFO] - agents played in iteration 418 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:40:20,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:20,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:21,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:21,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:21,041][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:21,042][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:21,757][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:22,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:22,376][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:22,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:23,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:24,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:24,644][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:24,970][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:25,298][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:25,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:25,950][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:26,280][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:26,610][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:26,936][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:27,265][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:27,591][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:28,240][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:28,564][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:28,889][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:29,215][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:29,538][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:29,862][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:30,186][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:30,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:30,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:31,809][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:32,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:32,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:33,599][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:33,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:33,602][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:34,860][__main__][INFO] - Iteration 419 took 23s (38.45% Gen, 56.15% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 50m 25s. Estimated total time: 19h 27m 10s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 31s. [2025-11-13 10:40:34,862][__main__][INFO] - Starting iteration 419. [2025-11-13 10:40:34,865][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:34,866][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:40:44,083][__main__][INFO] - Number of regex retries in iteration 419: 0 [2025-11-13 10:40:44,083][__main__][INFO] - agents played in iteration 419 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:40:44,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:44,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:44,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:44,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:40:44,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:40:44,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:40:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:40:45,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:40:45,988][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:40:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:40:46,636][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:40:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:40:47,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:40:47,607][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:40:47,930][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:40:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:40:48,576][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:40:48,899][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:40:49,224][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:40:49,548][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:40:49,872][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:40:50,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:40:50,526][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:40:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:40:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:40:51,501][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:40:51,827][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:40:52,151][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:40:52,475][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:40:52,799][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:40:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:40:53,448][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:40:53,772][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:40:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:40:54,421][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:40:54,746][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:40:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:40:55,394][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:40:55,721][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:40:56,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:40:57,210][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:40:57,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:40:57,213][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:40:58,645][__main__][INFO] - Iteration 420 took 23s (38.76% Gen, 55.21% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 51s. Estimated total time: 19h 49m 0s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 38s, 500 more iterations: 3h 18m 10s. [2025-11-13 10:40:58,647][__main__][INFO] - Starting iteration 420. [2025-11-13 10:40:58,650][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 10:40:58,651][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:08,371][__main__][INFO] - Number of regex retries in iteration 420: 0 [2025-11-13 10:41:08,371][__main__][INFO] - agents played in iteration 420 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:41:08,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,902][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:08,937][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:08,937][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:09,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:10,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:10,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:10,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:11,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:12,200][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:12,525][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:12,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:13,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:13,505][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:13,829][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:14,154][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:14,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:14,812][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:15,142][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:15,468][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:15,792][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:16,117][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:16,440][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:16,764][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:17,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:17,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:18,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:18,706][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:19,030][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:19,678][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:20,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:20,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:21,495][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:21,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:21,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:23,277][__main__][INFO] - Iteration 421 took 24s (39.47% Gen, 53.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 53m 47s. Estimated total time: 20h 31m 21s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 2s, 500 more iterations: 3h 25m 13s. [2025-11-13 10:41:23,279][__main__][INFO] - Starting iteration 421. [2025-11-13 10:41:23,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:41:23,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:32,290][__main__][INFO] - Number of regex retries in iteration 421: 0 [2025-11-13 10:41:32,290][__main__][INFO] - agents played in iteration 421 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:41:32,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:32,857][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:32,857][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:33,542][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:33,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:34,162][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:34,485][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:34,810][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:35,133][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:35,457][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:35,782][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:41:36,108][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:41:36,434][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:41:36,762][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:41:37,087][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:41:37,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:41:37,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:41:38,071][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:41:38,401][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:41:38,728][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:41:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:41:39,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:41:39,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:41:40,028][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:41:40,352][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:41:40,676][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:41:41,002][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:41:41,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:41:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:41:41,976][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:41:42,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:41:42,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:41:42,950][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:41:43,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:41:43,599][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:41:43,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:41:44,656][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:41:45,388][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:41:45,391][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:41:45,393][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:41:46,616][__main__][INFO] - Iteration 422 took 23s (38.60% Gen, 56.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 49s. Estimated total time: 19h 26m 46s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 53s, 500 more iterations: 3h 14m 27s. [2025-11-13 10:41:46,618][__main__][INFO] - Starting iteration 422. [2025-11-13 10:41:46,622][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:41:46,622][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:41:56,463][__main__][INFO] - Number of regex retries in iteration 422: 0 [2025-11-13 10:41:56,463][__main__][INFO] - agents played in iteration 422 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:41:56,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:56,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:56,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:57,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:41:57,029][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:41:57,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:41:57,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:41:58,055][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:41:58,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:41:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:41:59,027][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:41:59,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:41:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:41:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:00,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:00,649][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:00,974][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:01,622][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:02,271][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:02,923][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:03,247][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:03,571][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:03,895][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:04,218][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:04,543][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:04,868][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:05,192][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:05,516][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:05,841][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:06,164][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:06,488][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:06,812][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:07,135][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:07,460][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:07,784][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:08,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:08,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:09,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:09,561][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:09,562][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:10,522][__main__][INFO] - Iteration 423 took 23s (41.17% Gen, 54.81% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 16m 40s. Estimated total time: 19h 55m 2s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 10s. [2025-11-13 10:42:10,524][__main__][INFO] - Starting iteration 423. [2025-11-13 10:42:10,527][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:10,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:20,138][__main__][INFO] - Number of regex retries in iteration 423: 0 [2025-11-13 10:42:20,138][__main__][INFO] - agents played in iteration 423 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:42:20,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:20,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:20,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:20,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:20,698][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:20,699][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:21,422][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:21,718][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:22,044][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:22,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:23,018][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:23,345][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:23,667][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:23,994][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:24,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:24,644][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:24,969][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:25,294][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:25,619][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:25,948][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:26,272][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:26,597][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:26,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:27,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:27,571][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:28,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:29,192][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:29,518][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:30,167][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:30,492][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:30,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:31,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:32,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:33,241][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:33,242][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:33,244][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:34,418][__main__][INFO] - Iteration 424 took 23s (40.22% Gen, 54.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 15m 50s. Estimated total time: 19h 54m 36s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 6s. [2025-11-13 10:42:34,420][__main__][INFO] - Starting iteration 424. [2025-11-13 10:42:34,423][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:34,424][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:42:44,217][__main__][INFO] - Number of regex retries in iteration 424: 0 [2025-11-13 10:42:44,218][__main__][INFO] - agents played in iteration 424 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:42:44,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:44,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:44,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:44,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:42:44,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:42:44,792][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:42:45,477][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:42:45,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:42:46,095][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:42:46,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:42:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:42:47,069][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:42:47,395][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:42:47,721][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:42:48,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:42:48,370][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:42:48,695][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:42:49,026][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:42:49,352][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:42:49,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:42:50,005][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:42:50,329][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:42:50,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:42:50,979][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:42:51,304][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:42:51,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:42:51,953][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:42:52,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:42:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:42:52,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:42:53,253][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:42:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:42:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:42:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:42:54,553][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:42:54,877][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:42:55,201][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:42:55,527][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:42:55,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:42:56,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:42:57,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:42:57,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:42:57,304][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:42:58,257][__main__][INFO] - Iteration 425 took 23s (41.09% Gen, 54.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 12m 34s. Estimated total time: 19h 51m 43s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 37s. [2025-11-13 10:42:58,259][__main__][INFO] - Starting iteration 425. [2025-11-13 10:42:58,262][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:42:58,262][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:07,758][__main__][INFO] - Number of regex retries in iteration 425: 0 [2025-11-13 10:43:07,758][__main__][INFO] - agents played in iteration 425 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:43:08,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:08,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:08,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:08,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:08,323][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:08,323][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:09,630][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:10,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:11,249][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:11,575][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:11,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:12,224][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:12,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:12,878][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:13,204][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:13,854][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:14,179][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:14,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:15,155][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:15,478][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:15,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:16,128][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:16,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:16,780][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:17,430][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:17,756][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:18,081][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:18,407][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:18,733][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:19,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:19,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:20,137][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:20,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:20,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:20,889][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:21,885][__main__][INFO] - Iteration 426 took 23s (40.19% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 1m 39s. Estimated total time: 19h 41m 12s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 52s. [2025-11-13 10:43:21,887][__main__][INFO] - Starting iteration 426. [2025-11-13 10:43:21,889][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:21,890][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:31,526][__main__][INFO] - Number of regex retries in iteration 426: 0 [2025-11-13 10:43:31,526][__main__][INFO] - agents played in iteration 426 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:43:31,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:32,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:32,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:32,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:32,080][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:32,080][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:33,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:33,728][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:34,379][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:34,704][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:35,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:36,007][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:36,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:43:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:43:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:43:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:43:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:43:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:43:38,286][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:43:38,614][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:43:38,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:43:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:43:39,589][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:43:39,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:43:40,240][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:43:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:43:40,893][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:43:41,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:43:41,543][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:43:41,868][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:43:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:43:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:43:42,843][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:43:43,167][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:43:43,896][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:43:44,628][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:43:44,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:43:44,631][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:43:45,659][__main__][INFO] - Iteration 427 took 23s (40.54% Gen, 55.13% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 8m 36s. Estimated total time: 19h 48m 33s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 5s. [2025-11-13 10:43:45,662][__main__][INFO] - Starting iteration 427. [2025-11-13 10:43:45,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:43:45,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:43:54,937][__main__][INFO] - Number of regex retries in iteration 427: 0 [2025-11-13 10:43:54,937][__main__][INFO] - agents played in iteration 427 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:43:55,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:55,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:55,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:55,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:43:55,497][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:43:55,497][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:43:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:43:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:43:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:43:57,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:43:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:43:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:43:58,120][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:43:58,447][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:43:58,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:43:59,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:43:59,423][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:43:59,747][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:00,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:00,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:00,720][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:01,045][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:01,694][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:02,019][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:02,343][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:02,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:03,319][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:03,644][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:03,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:04,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:04,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:04,942][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:05,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:05,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:06,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:06,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:07,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:08,024][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:08,026][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:08,027][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:09,082][__main__][INFO] - Iteration 428 took 23s (39.59% Gen, 55.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 50m 36s. Estimated total time: 19h 30m 56s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 9s. [2025-11-13 10:44:09,084][__main__][INFO] - Starting iteration 428. [2025-11-13 10:44:09,087][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:09,087][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:18,094][__main__][INFO] - Number of regex retries in iteration 428: 0 [2025-11-13 10:44:18,094][__main__][INFO] - agents played in iteration 428 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:44:18,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,653][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:18,653][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:18,654][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:19,363][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:19,657][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:19,981][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:20,304][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:20,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:20,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:21,275][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:21,600][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:21,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:22,251][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:22,575][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:22,901][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:23,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:23,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:23,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:24,853][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:25,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:25,503][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:25,828][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:26,153][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:26,477][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:26,803][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:27,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:27,452][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:27,778][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:28,102][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:28,428][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:28,754][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:29,080][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:29,406][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:29,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:30,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:31,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:31,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:31,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:32,210][__main__][INFO] - Iteration 429 took 23s (38.95% Gen, 56.69% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 30s. Estimated total time: 19h 16m 13s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 42s. [2025-11-13 10:44:32,213][__main__][INFO] - Starting iteration 429. [2025-11-13 10:44:32,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:32,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:44:41,535][__main__][INFO] - Number of regex retries in iteration 429: 0 [2025-11-13 10:44:41,536][__main__][INFO] - agents played in iteration 429 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:44:42,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:42,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:42,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:42,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:44:42,099][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:44:42,099][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:44:42,793][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:44:43,086][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:44:43,411][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:44:43,735][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:44:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:44:44,384][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:44:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:44:45,031][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:44:45,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:44:45,681][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:44:46,006][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:44:46,330][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:44:46,655][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:44:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:44:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:44:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:44:47,959][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:44:48,284][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:44:48,607][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:44:48,933][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:44:49,259][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:44:49,583][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:44:49,908][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:44:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:44:50,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:44:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:44:51,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:44:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:44:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:44:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:44:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:44:52,831][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:44:53,158][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:44:53,895][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:44:54,629][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:44:54,631][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:44:54,632][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:44:55,564][__main__][INFO] - Iteration 430 took 23s (39.91% Gen, 56.09% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 46m 22s. Estimated total time: 19h 27m 28s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 34s. [2025-11-13 10:44:55,566][__main__][INFO] - Starting iteration 430. [2025-11-13 10:44:55,569][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 10:44:55,569][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:04,677][__main__][INFO] - Number of regex retries in iteration 430: 0 [2025-11-13 10:45:04,678][__main__][INFO] - agents played in iteration 430 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:45:05,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:05,241][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:05,241][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:05,968][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:06,261][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:06,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:06,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:07,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:07,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:08,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:08,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:08,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:09,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:09,517][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:09,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:10,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:11,158][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:11,484][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:12,461][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:13,113][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:13,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:14,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:14,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:14,738][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:15,063][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:15,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:16,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:17,084][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:17,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:17,780][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:17,781][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:19,665][__main__][INFO] - Iteration 431 took 24s (37.80% Gen, 54.38% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 23m 20s. Estimated total time: 20h 4m 51s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 48s. [2025-11-13 10:45:19,667][__main__][INFO] - Starting iteration 431. [2025-11-13 10:45:19,670][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:45:19,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:28,721][__main__][INFO] - Number of regex retries in iteration 431: 0 [2025-11-13 10:45:28,721][__main__][INFO] - agents played in iteration 431 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:45:29,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:29,623][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:29,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:30,318][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:31,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:31,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:32,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:32,556][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:32,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:33,208][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:34,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:35,157][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:35,808][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:36,465][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:45:36,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:45:37,115][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:45:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:45:37,764][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:45:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:45:38,413][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:45:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:45:39,061][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:45:39,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:45:39,713][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:45:40,038][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:45:40,364][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:45:40,689][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:45:41,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:45:42,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:45:42,134][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:45:42,136][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:45:43,104][__main__][INFO] - Iteration 432 took 23s (38.62% Gen, 57.24% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 49m 50s. Estimated total time: 19h 31m 44s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 3s, 500 more iterations: 3h 15m 17s. [2025-11-13 10:45:43,106][__main__][INFO] - Starting iteration 432. [2025-11-13 10:45:43,109][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:45:43,110][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:45:52,488][__main__][INFO] - Number of regex retries in iteration 432: 0 [2025-11-13 10:45:52,489][__main__][INFO] - agents played in iteration 432 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:45:52,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:52,971][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:53,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:53,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:45:53,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:45:53,039][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:45:53,752][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:45:54,047][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:45:54,372][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:45:54,696][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:45:55,019][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:45:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:45:55,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:45:56,003][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:45:56,328][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:45:56,658][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:45:56,982][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:45:57,312][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:45:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:45:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:45:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:45:58,619][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:45:58,944][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:45:59,269][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:45:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:45:59,919][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:00,245][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:00,569][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:01,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:01,546][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:03,170][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:03,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:04,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:04,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:05,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:05,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:05,555][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:06,554][__main__][INFO] - Iteration 433 took 23s (40.00% Gen, 55.73% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 50m 1s. Estimated total time: 19h 32m 19s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 23s. [2025-11-13 10:46:06,557][__main__][INFO] - Starting iteration 433. [2025-11-13 10:46:06,560][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:06,560][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:15,646][__main__][INFO] - Number of regex retries in iteration 433: 0 [2025-11-13 10:46:15,647][__main__][INFO] - agents played in iteration 433 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:46:16,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:16,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:16,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:16,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:16,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:16,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:17,280][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:17,574][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:17,898][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:18,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:18,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:18,869][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:19,193][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:19,517][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:19,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:20,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:20,811][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:21,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:21,791][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:22,443][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:22,768][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:23,092][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:23,418][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:23,744][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:25,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:25,692][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:26,019][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:26,670][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:26,994][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:27,646][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:28,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:29,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:29,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:29,080][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:30,101][__main__][INFO] - Iteration 434 took 23s (38.60% Gen, 57.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 54m 24s. Estimated total time: 19h 37m 5s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 14s, 500 more iterations: 3h 16m 10s. [2025-11-13 10:46:30,103][__main__][INFO] - Starting iteration 434. [2025-11-13 10:46:30,106][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:30,107][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:46:40,034][__main__][INFO] - Number of regex retries in iteration 434: 0 [2025-11-13 10:46:40,035][__main__][INFO] - agents played in iteration 434 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:46:40,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:46:40,599][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:46:40,599][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:46:41,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:46:41,628][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:46:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:46:42,274][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:46:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:46:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:46:43,242][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:46:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:46:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:46:44,213][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:46:44,538][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:46:44,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:46:45,189][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:46:45,512][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:46:45,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:46:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:46:46,488][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:46:46,812][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:46:47,139][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:46:47,463][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:46:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:46:48,114][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:46:48,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:46:48,763][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:46:49,088][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:46:49,413][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:46:49,738][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:46:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:46:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:46:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:46:51,037][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:46:51,361][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:46:51,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:46:52,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:46:53,163][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:46:53,165][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:46:53,166][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:46:54,160][__main__][INFO] - Iteration 435 took 24s (41.27% Gen, 54.59% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 19m 39s. Estimated total time: 20h 2m 44s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 27s. [2025-11-13 10:46:54,162][__main__][INFO] - Starting iteration 435. [2025-11-13 10:46:54,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:46:54,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:03,296][__main__][INFO] - Number of regex retries in iteration 435: 0 [2025-11-13 10:47:03,296][__main__][INFO] - agents played in iteration 435 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:47:03,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:03,860][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:03,861][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:04,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:05,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:05,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:05,857][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:06,181][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:06,506][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:06,829][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:07,154][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:07,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:08,454][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:09,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:09,434][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:10,088][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:10,415][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:10,741][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:11,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:11,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:12,366][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:12,692][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:13,018][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:13,341][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:13,991][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:14,318][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:14,642][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:14,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:15,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:16,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:16,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:16,464][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:17,480][__main__][INFO] - Iteration 436 took 23s (39.16% Gen, 56.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 42m 19s. Estimated total time: 19h 25m 48s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 18s. [2025-11-13 10:47:17,482][__main__][INFO] - Starting iteration 436. [2025-11-13 10:47:17,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:17,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:26,746][__main__][INFO] - Number of regex retries in iteration 436: 0 [2025-11-13 10:47:26,747][__main__][INFO] - agents played in iteration 436 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:47:27,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:27,313][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:27,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:28,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:28,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:28,668][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:28,991][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:29,316][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:29,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:29,964][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:30,289][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:30,614][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:30,939][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:31,265][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:31,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:32,241][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:32,890][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:33,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:33,864][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:34,190][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:34,514][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:34,839][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:35,492][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:35,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:36,143][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:47:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:47:36,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:47:37,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:47:37,444][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:47:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:47:38,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:47:38,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:47:39,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:47:39,924][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:47:39,925][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:47:39,927][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:47:41,016][__main__][INFO] - Iteration 437 took 23s (39.35% Gen, 56.01% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 52m 40s. Estimated total time: 19h 36m 32s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 13s, 500 more iterations: 3h 16m 5s. [2025-11-13 10:47:41,018][__main__][INFO] - Starting iteration 437. [2025-11-13 10:47:41,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:47:41,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:47:50,289][__main__][INFO] - Number of regex retries in iteration 437: 0 [2025-11-13 10:47:50,290][__main__][INFO] - agents played in iteration 437 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:47:50,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:47:50,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:47:50,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:47:51,593][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:47:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:47:52,211][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:47:52,537][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:47:52,865][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:47:53,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:47:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:47:53,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:47:54,167][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:47:54,492][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:47:54,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:47:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:47:55,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:47:55,797][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:47:56,126][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:47:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:47:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:47:57,111][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:47:57,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:47:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:47:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:47:58,413][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:47:58,738][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:47:59,062][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:47:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:47:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:00,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:00,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:00,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:01,013][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:01,338][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:01,665][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:01,989][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:02,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:03,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:03,465][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:03,467][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:04,481][__main__][INFO] - Iteration 438 took 23s (39.51% Gen, 56.17% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 45s. Estimated total time: 19h 33m 0s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 6s, 500 more iterations: 3h 15m 30s. [2025-11-13 10:48:04,483][__main__][INFO] - Starting iteration 438. [2025-11-13 10:48:04,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:04,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:14,110][__main__][INFO] - Number of regex retries in iteration 438: 0 [2025-11-13 10:48:14,111][__main__][INFO] - agents played in iteration 438 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:48:14,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,651][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:14,685][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:14,686][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:15,398][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:16,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:16,341][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:16,664][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:16,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:17,317][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:18,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:18,942][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:19,266][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:19,592][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:20,895][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:21,221][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:22,197][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:22,523][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:23,172][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:23,497][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:23,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:24,148][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:24,797][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:25,121][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:25,446][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:25,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:26,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:27,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:27,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:27,265][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:28,229][__main__][INFO] - Iteration 439 took 23s (40.53% Gen, 55.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 2m 31s. Estimated total time: 19h 47m 10s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 34s, 500 more iterations: 3h 17m 51s. [2025-11-13 10:48:28,231][__main__][INFO] - Starting iteration 439. [2025-11-13 10:48:28,234][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:28,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:48:37,012][__main__][INFO] - Number of regex retries in iteration 439: 0 [2025-11-13 10:48:37,013][__main__][INFO] - agents played in iteration 439 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:48:37,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:48:37,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:48:37,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:48:38,292][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:48:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:48:38,911][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:48:39,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:48:39,560][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:48:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:48:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:48:40,534][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:48:40,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:48:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:48:41,506][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:48:41,831][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:48:42,159][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:48:42,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:48:42,809][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:48:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:48:43,462][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:48:43,788][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:48:44,117][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:48:44,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:48:44,766][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:48:45,092][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:48:45,418][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:48:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:48:46,068][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:48:46,393][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:48:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:48:47,044][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:48:47,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:48:47,693][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:48:48,018][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:48:48,342][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:48:48,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:48:49,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:48:50,121][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:48:50,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:48:50,124][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:48:51,097][__main__][INFO] - Iteration 440 took 22s (38.39% Gen, 57.35% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 18m 10s. Estimated total time: 19h 3m 12s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 6s, 500 more iterations: 3h 10m 32s. [2025-11-13 10:48:51,099][__main__][INFO] - Starting iteration 440. [2025-11-13 10:48:51,102][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 10:48:51,103][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:00,258][__main__][INFO] - Number of regex retries in iteration 440: 0 [2025-11-13 10:49:00,259][__main__][INFO] - agents played in iteration 440 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:49:00,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:00,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:00,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:00,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:00,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:00,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:01,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:01,846][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:02,171][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:02,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:02,819][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:03,465][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:03,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:04,763][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:05,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:06,066][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:06,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:06,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:07,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:07,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:07,695][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:08,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:08,670][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:09,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:09,972][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:10,298][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:10,623][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:10,948][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:11,273][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:11,599][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:11,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:12,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:13,380][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:13,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:13,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:15,398][__main__][INFO] - Iteration 441 took 24s (37.68% Gen, 54.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 29m 24s. Estimated total time: 20h 14m 50s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 29s, 500 more iterations: 3h 22m 28s. [2025-11-13 10:49:15,401][__main__][INFO] - Starting iteration 441. [2025-11-13 10:49:15,404][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:15,404][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:25,489][__main__][INFO] - Number of regex retries in iteration 441: 0 [2025-11-13 10:49:25,490][__main__][INFO] - agents played in iteration 441 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:49:25,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:25,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:26,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:26,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:26,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:26,057][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:26,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:27,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:27,709][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:28,034][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:28,358][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:28,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:29,008][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:29,334][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:29,658][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:29,983][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:30,309][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:30,635][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:30,959][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:31,287][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:31,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:31,938][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:32,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:32,588][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:32,912][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:33,237][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:33,561][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:33,887][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:34,214][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:34,539][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:34,864][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:35,189][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:35,514][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:35,838][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:49:36,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:49:36,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:49:36,813][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:49:37,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:49:37,877][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:49:38,613][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:49:38,653][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:49:38,679][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:49:39,793][__main__][INFO] - Iteration 442 took 24s (41.35% Gen, 54.08% Train). Generation: 10s, Training: 13s. Estimated remaining time: 17h 33m 39s. Estimated total time: 20h 19m 30s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 39s, 500 more iterations: 3h 23m 15s. [2025-11-13 10:49:39,796][__main__][INFO] - Starting iteration 442. [2025-11-13 10:49:39,799][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:49:39,799][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:49:49,442][__main__][INFO] - Number of regex retries in iteration 442: 0 [2025-11-13 10:49:49,442][__main__][INFO] - agents played in iteration 442 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:49:49,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:49,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:49,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:50,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:49:50,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:49:50,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:49:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:49:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:49:51,390][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:49:51,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:49:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:49:52,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:49:52,700][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:49:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:49:53,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:49:53,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:49:54,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:49:54,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:49:54,650][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:49:54,976][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:49:55,301][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:49:55,625][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:49:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:49:56,275][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:49:56,599][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:49:56,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:49:57,249][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:49:57,573][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:49:57,897][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:49:58,220][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:49:58,544][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:49:58,871][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:49:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:49:59,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:49:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:00,169][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:00,492][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:00,817][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:01,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:01,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:02,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:02,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:02,637][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:03,743][__main__][INFO] - Iteration 443 took 23s (40.27% Gen, 55.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 11m 0s. Estimated total time: 19h 57m 14s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 32s. [2025-11-13 10:50:03,746][__main__][INFO] - Starting iteration 443. [2025-11-13 10:50:03,750][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:03,750][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:12,714][__main__][INFO] - Number of regex retries in iteration 443: 0 [2025-11-13 10:50:12,715][__main__][INFO] - agents played in iteration 443 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:50:13,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:13,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:13,296][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:14,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:14,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:15,291][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:15,944][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:16,594][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:16,919][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:17,248][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:17,902][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:18,229][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:18,554][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:18,879][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:19,206][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:19,531][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:20,180][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:20,505][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:20,831][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:21,156][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:21,481][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:21,808][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:23,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:23,433][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:23,759][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:24,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:25,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:25,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:25,880][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:25,881][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:26,870][__main__][INFO] - Iteration 444 took 23s (38.77% Gen, 56.95% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 29m 25s. Estimated total time: 19h 16m 2s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 32s, 500 more iterations: 3h 12m 40s. [2025-11-13 10:50:26,872][__main__][INFO] - Starting iteration 444. [2025-11-13 10:50:26,875][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:26,876][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:36,674][__main__][INFO] - Number of regex retries in iteration 444: 0 [2025-11-13 10:50:36,675][__main__][INFO] - agents played in iteration 444 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:50:37,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:50:37,241][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:50:37,241][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:50:37,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:50:38,254][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:50:38,578][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:50:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:50:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:50:39,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:50:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:50:40,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:50:40,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:50:40,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:50:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:50:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:50:41,842][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:50:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:50:42,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:50:42,822][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:50:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:50:43,471][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:50:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:50:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:50:44,441][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:50:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:50:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:50:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:50:45,736][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:50:46,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:50:46,387][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:50:46,712][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:50:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:50:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:50:47,691][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:50:48,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:50:48,342][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:50:49,086][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:50:49,821][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:50:49,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:50:49,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:50:50,968][__main__][INFO] - Iteration 445 took 24s (40.67% Gen, 54.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 17m 39s. Estimated total time: 20h 4m 41s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 46s. [2025-11-13 10:50:50,971][__main__][INFO] - Starting iteration 445. [2025-11-13 10:50:50,974][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:50:50,974][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:50:59,835][__main__][INFO] - Number of regex retries in iteration 445: 0 [2025-11-13 10:50:59,836][__main__][INFO] - agents played in iteration 445 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:51:00,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:00,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:00,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:00,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:00,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:00,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:01,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:01,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:02,098][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:03,085][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:03,735][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:04,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:04,388][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:05,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:05,691][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:06,016][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:06,340][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:06,665][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:07,312][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:07,636][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:08,285][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:08,611][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:08,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:09,261][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:09,587][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:09,912][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:10,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:10,561][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:10,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:11,210][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:11,534][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:12,247][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:12,979][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:12,981][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:12,982][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:14,017][__main__][INFO] - Iteration 446 took 23s (38.46% Gen, 57.05% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 24m 47s. Estimated total time: 19h 12m 12s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 24s, 500 more iterations: 3h 12m 2s. [2025-11-13 10:51:14,019][__main__][INFO] - Starting iteration 446. [2025-11-13 10:51:14,023][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:14,023][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:22,980][__main__][INFO] - Number of regex retries in iteration 446: 0 [2025-11-13 10:51:22,981][__main__][INFO] - agents played in iteration 446 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:51:23,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:23,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:23,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:23,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:23,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:23,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:24,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:24,919][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:25,246][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:26,218][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:26,544][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:26,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:27,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:27,520][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:27,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:28,173][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:28,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:28,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:29,149][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:29,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:29,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:30,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:30,449][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:30,774][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:31,422][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:31,748][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:32,073][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:32,397][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:32,722][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:33,047][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:33,372][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:33,695][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:34,019][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:34,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:34,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:34,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:35,843][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:51:36,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:51:36,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:51:36,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:51:37,595][__main__][INFO] - Iteration 447 took 23s (38.00% Gen, 57.77% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 50m 51s. Estimated total time: 19h 38m 39s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 26s. [2025-11-13 10:51:37,597][__main__][INFO] - Starting iteration 447. [2025-11-13 10:51:37,601][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:51:37,601][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:51:47,084][__main__][INFO] - Number of regex retries in iteration 447: 0 [2025-11-13 10:51:47,085][__main__][INFO] - agents played in iteration 447 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:51:47,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:47,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:47,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:47,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:51:47,657][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:51:47,658][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:51:48,372][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:51:48,667][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:51:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:51:49,321][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:51:49,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:51:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:51:50,295][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:51:50,624][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:51:50,951][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:51:51,273][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:51:51,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:51:51,925][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:51:52,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:51:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:51:52,901][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:51:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:51:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:51:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:51:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:51:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:51:54,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:51:55,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:51:55,495][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:51:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:51:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:51:56,468][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:51:56,794][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:51:57,118][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:51:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:51:57,766][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:51:58,090][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:51:58,413][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:51:58,738][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:51:59,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:00,206][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:00,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:00,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:01,230][__main__][INFO] - Iteration 448 took 23s (40.13% Gen, 55.54% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 53m 19s. Estimated total time: 19h 41m 31s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 55s. [2025-11-13 10:52:01,232][__main__][INFO] - Starting iteration 448. [2025-11-13 10:52:01,235][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:01,236][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:10,195][__main__][INFO] - Number of regex retries in iteration 448: 0 [2025-11-13 10:52:10,196][__main__][INFO] - agents played in iteration 448 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:52:10,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:10,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:10,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:10,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:10,765][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:10,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:11,492][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:11,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:12,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:12,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:13,094][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:14,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:14,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:15,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:15,370][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:15,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:16,349][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:16,673][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:16,998][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:17,322][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:17,647][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:17,971][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:18,296][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:18,621][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:18,947][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:19,270][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:19,593][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:20,243][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:20,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:20,893][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:21,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:21,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:21,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:22,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:23,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:23,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:23,330][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:24,348][__main__][INFO] - Iteration 449 took 23s (38.77% Gen, 56.83% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 27m 4s. Estimated total time: 19h 15m 39s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 36s. [2025-11-13 10:52:24,350][__main__][INFO] - Starting iteration 449. [2025-11-13 10:52:24,354][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:24,354][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:33,908][__main__][INFO] - Number of regex retries in iteration 449: 0 [2025-11-13 10:52:33,908][__main__][INFO] - agents played in iteration 449 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:52:34,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:34,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:34,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:35,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:35,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:36,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:36,474][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:52:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:52:37,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:52:37,447][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:52:37,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:52:38,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:52:38,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:52:38,745][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:52:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:52:39,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:52:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:52:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:52:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:52:40,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:52:41,021][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:52:41,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:52:41,670][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:52:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:52:42,320][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:52:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:52:42,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:52:43,294][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:52:43,619][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:52:43,944][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:52:44,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:52:44,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:52:44,919][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:52:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:52:45,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:52:46,315][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:52:47,066][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:52:47,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:52:47,069][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:52:48,263][__main__][INFO] - Iteration 450 took 23s (39.96% Gen, 55.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 31s. Estimated total time: 19h 55m 30s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 15s. [2025-11-13 10:52:48,265][__main__][INFO] - Starting iteration 450. [2025-11-13 10:52:48,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 10:52:48,269][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:52:57,289][__main__][INFO] - Number of regex retries in iteration 450: 0 [2025-11-13 10:52:57,290][__main__][INFO] - agents played in iteration 450 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:52:57,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:52:57,857][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:52:57,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:52:58,597][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:52:58,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:52:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:52:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:52:59,863][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:00,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:00,514][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:00,837][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:01,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:02,462][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:04,086][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:04,411][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:05,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:05,711][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:06,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:06,360][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:06,685][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:07,010][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:07,662][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:07,986][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:08,310][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:08,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:08,962][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:09,694][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:10,429][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:10,431][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:10,433][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:12,336][__main__][INFO] - Iteration 451 took 24s (37.48% Gen, 54.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 14m 2s. Estimated total time: 20h 3m 25s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 6s, 500 more iterations: 3h 20m 34s. [2025-11-13 10:53:12,338][__main__][INFO] - Starting iteration 451. [2025-11-13 10:53:12,341][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:12,341][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:21,951][__main__][INFO] - Number of regex retries in iteration 451: 0 [2025-11-13 10:53:21,952][__main__][INFO] - agents played in iteration 451 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:53:22,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:22,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:22,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:22,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:22,517][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:22,517][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:23,202][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:23,495][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:23,819][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:24,143][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:24,797][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:25,123][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:25,776][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:26,102][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:26,429][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:26,756][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:27,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:27,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:27,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:28,064][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:28,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:28,714][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:29,364][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:29,688][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:30,013][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:30,339][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:30,665][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:30,990][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:31,314][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:31,639][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:31,965][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:32,290][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:32,616][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:32,941][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:33,265][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:33,590][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:34,326][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:35,059][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:35,060][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:35,062][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:36,261][__main__][INFO] - Iteration 452 took 23s (40.17% Gen, 54.80% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 6m 16s. Estimated total time: 19h 56m 3s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 52s, 500 more iterations: 3h 19m 20s. [2025-11-13 10:53:36,263][__main__][INFO] - Starting iteration 452. [2025-11-13 10:53:36,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:36,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:53:45,699][__main__][INFO] - Number of regex retries in iteration 452: 0 [2025-11-13 10:53:45,699][__main__][INFO] - agents played in iteration 452 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:53:46,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:53:46,270][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:53:46,270][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:53:47,007][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:53:47,301][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:53:47,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:53:47,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:53:48,277][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:53:48,601][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:53:48,925][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:53:49,250][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:53:49,575][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:53:49,901][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:53:50,227][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:53:50,552][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:53:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:53:51,201][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:53:51,525][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:53:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:53:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:53:52,501][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:53:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:53:53,151][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:53:53,475][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:53:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:53:54,124][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:53:54,449][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:53:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:53:55,098][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:53:55,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:53:55,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:53:56,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:53:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:53:56,723][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:53:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:53:57,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:53:58,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:53:58,843][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:53:58,845][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:53:58,846][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:53:59,834][__main__][INFO] - Iteration 453 took 23s (40.02% Gen, 55.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 12s. Estimated total time: 19h 38m 23s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 16s, 500 more iterations: 3h 16m 23s. [2025-11-13 10:53:59,836][__main__][INFO] - Starting iteration 453. [2025-11-13 10:53:59,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:53:59,839][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:08,797][__main__][INFO] - Number of regex retries in iteration 453: 0 [2025-11-13 10:54:08,798][__main__][INFO] - agents played in iteration 453 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:54:09,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:09,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:09,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:09,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:09,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:09,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:10,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:10,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:10,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:11,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:11,367][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:11,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:12,023][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:12,349][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:12,997][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:13,323][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:13,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:13,980][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:14,303][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:14,628][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:15,277][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:15,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:16,578][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:17,229][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:17,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:18,207][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:18,531][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:18,855][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:19,180][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:19,505][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:20,161][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:20,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:21,210][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:21,941][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:21,942][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:21,944][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:22,920][__main__][INFO] - Iteration 454 took 23s (38.81% Gen, 56.95% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 23m 32s. Estimated total time: 19h 14m 6s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 21s. [2025-11-13 10:54:22,922][__main__][INFO] - Starting iteration 454. [2025-11-13 10:54:22,925][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:22,926][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:31,990][__main__][INFO] - Number of regex retries in iteration 454: 0 [2025-11-13 10:54:31,991][__main__][INFO] - agents played in iteration 454 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:54:32,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:32,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:32,520][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:32,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:32,553][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:32,554][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:33,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:33,566][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:33,894][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:34,224][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:34,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:34,870][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:35,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:35,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:36,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:36,497][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:54:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:54:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:54:37,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:54:37,799][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:54:38,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:54:38,448][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:54:38,773][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:54:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:54:39,421][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:54:39,745][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:54:40,070][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:54:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:54:40,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:54:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:54:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:54:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:54:42,021][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:54:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:54:42,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:54:42,998][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:54:43,325][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:54:43,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:54:44,356][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:54:45,090][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:54:45,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:54:45,094][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:54:46,095][__main__][INFO] - Iteration 455 took 23s (39.12% Gen, 56.55% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 27m 35s. Estimated total time: 19h 18m 32s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 37s, 500 more iterations: 3h 13m 5s. [2025-11-13 10:54:46,097][__main__][INFO] - Starting iteration 455. [2025-11-13 10:54:46,100][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:54:46,101][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:54:55,032][__main__][INFO] - Number of regex retries in iteration 455: 0 [2025-11-13 10:54:55,032][__main__][INFO] - agents played in iteration 455 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:54:55,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:55,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:55,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:55,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:54:55,584][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:54:55,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:54:56,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:54:56,930][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:54:57,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:54:57,579][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:54:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:54:58,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:54:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:54:58,878][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:54:59,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:54:59,532][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:54:59,857][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:00,186][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:00,510][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:00,839][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:01,164][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:01,490][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:01,814][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:02,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:02,464][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:02,789][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:03,113][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:03,438][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:03,762][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:04,087][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:04,411][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:04,734][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:05,058][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:05,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:05,708][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:06,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:06,356][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:07,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:07,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:08,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:08,465][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:08,466][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:09,669][__main__][INFO] - Iteration 456 took 23s (37.89% Gen, 57.00% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 47m 9s. Estimated total time: 19h 38m 30s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 17s, 500 more iterations: 3h 16m 25s. [2025-11-13 10:55:09,673][__main__][INFO] - Starting iteration 456. [2025-11-13 10:55:09,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:09,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:18,756][__main__][INFO] - Number of regex retries in iteration 456: 0 [2025-11-13 10:55:18,756][__main__][INFO] - agents played in iteration 456 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:55:19,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:19,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:19,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:20,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:20,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:21,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:21,652][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:22,304][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:23,286][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:23,939][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:24,267][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:24,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:24,920][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:25,244][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:26,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:26,543][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:27,193][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:27,518][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:27,843][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:28,168][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:28,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:29,141][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:29,465][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:30,113][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:30,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:31,143][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:31,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:31,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:31,871][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:32,898][__main__][INFO] - Iteration 457 took 23s (39.09% Gen, 56.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 29m 25s. Estimated total time: 19h 21m 8s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 31s. [2025-11-13 10:55:32,900][__main__][INFO] - Starting iteration 457. [2025-11-13 10:55:32,903][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:32,904][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:55:42,494][__main__][INFO] - Number of regex retries in iteration 457: 0 [2025-11-13 10:55:42,494][__main__][INFO] - agents played in iteration 457 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:55:42,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:42,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:43,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:43,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:55:43,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:55:43,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:55:43,802][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:55:44,096][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:55:44,424][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:55:44,749][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:55:45,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:55:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:55:45,732][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:55:46,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:55:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:55:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:55:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:55:47,366][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:55:47,699][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:55:48,030][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:55:48,356][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:55:48,682][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:55:49,008][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:55:49,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:55:49,657][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:55:49,982][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:55:50,307][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:55:50,631][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:55:50,956][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:55:51,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:55:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:55:51,929][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:55:52,255][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:55:52,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:55:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:55:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:55:53,561][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:55:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:55:54,211][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:55:54,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:55:55,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:55:55,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:55:55,679][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:55:56,727][__main__][INFO] - Iteration 458 took 23s (40.26% Gen, 55.34% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 6s. Estimated total time: 19h 51m 13s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 32s. [2025-11-13 10:55:56,729][__main__][INFO] - Starting iteration 458. [2025-11-13 10:55:56,732][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:55:56,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:05,252][__main__][INFO] - Number of regex retries in iteration 458: 0 [2025-11-13 10:56:05,252][__main__][INFO] - agents played in iteration 458 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:56:05,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:06,140][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:06,141][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:06,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:07,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:08,450][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:09,097][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:09,425][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:09,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:10,075][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:10,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:11,056][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:11,712][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:12,036][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:12,684][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:13,985][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:14,309][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:14,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:14,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:15,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:15,609][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:15,934][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:16,258][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:16,583][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:16,908][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:17,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:17,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:18,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:18,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:18,657][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:19,676][__main__][INFO] - Iteration 459 took 22s (37.13% Gen, 58.42% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 14m 43s. Estimated total time: 19h 7m 13s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 14s, 500 more iterations: 3h 11m 12s. [2025-11-13 10:56:19,678][__main__][INFO] - Starting iteration 459. [2025-11-13 10:56:19,681][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:19,682][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:28,481][__main__][INFO] - Number of regex retries in iteration 459: 0 [2025-11-13 10:56:28,482][__main__][INFO] - agents played in iteration 459 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:56:28,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:28,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:29,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:29,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:29,046][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:29,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:30,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:30,394][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:30,720][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:31,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:31,698][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:32,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:33,665][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:33,991][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:34,322][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:34,973][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:35,300][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:35,623][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:35,949][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:36,272][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:36,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:56:36,922][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:56:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:56:37,573][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:56:37,896][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:56:38,221][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:56:38,546][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:56:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:56:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:56:39,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:56:39,843][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:56:40,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:56:40,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:56:41,592][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:56:41,594][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:56:41,596][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:56:42,616][__main__][INFO] - Iteration 460 took 22s (38.37% Gen, 57.18% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 13m 51s. Estimated total time: 19h 6m 45s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 13s, 500 more iterations: 3h 11m 7s. [2025-11-13 10:56:42,618][__main__][INFO] - Starting iteration 460. [2025-11-13 10:56:42,621][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 10:56:42,621][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:56:51,706][__main__][INFO] - Number of regex retries in iteration 460: 0 [2025-11-13 10:56:51,707][__main__][INFO] - agents played in iteration 460 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:56:52,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:52,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:52,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:52,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:56:52,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:56:52,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:56:52,975][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:56:53,271][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:56:53,597][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:56:53,927][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:56:54,252][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:56:54,580][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:56:54,907][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:56:55,230][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:56:55,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:56:55,887][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:56:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:56:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:56:56,869][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:56:57,200][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:56:57,526][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:56:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:56:58,178][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:56:58,504][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:56:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:56:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:56:59,480][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:56:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:00,129][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:01,102][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:01,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:02,076][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:02,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:02,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:03,049][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:03,373][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:04,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:04,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:04,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:04,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:06,805][__main__][INFO] - Iteration 461 took 24s (37.57% Gen, 54.15% Train). Generation: 9s, Training: 13s. Estimated remaining time: 17h 15m 56s. Estimated total time: 20h 9m 14s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 18s, 500 more iterations: 3h 21m 32s. [2025-11-13 10:57:06,807][__main__][INFO] - Starting iteration 461. [2025-11-13 10:57:06,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:06,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:16,249][__main__][INFO] - Number of regex retries in iteration 461: 0 [2025-11-13 10:57:16,250][__main__][INFO] - agents played in iteration 461 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:57:16,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:16,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:16,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:16,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:16,826][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:16,826][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:17,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:17,870][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:18,197][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:20,165][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:20,491][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:20,817][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:21,473][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:21,801][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:22,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:22,779][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:23,108][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:23,433][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:23,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:24,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:24,404][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:24,729][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:25,701][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:26,027][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:26,353][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:26,679][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:27,003][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:27,652][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:27,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:28,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:29,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:29,433][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:29,434][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:30,454][__main__][INFO] - Iteration 462 took 23s (39.92% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 31s. Estimated total time: 19h 42m 12s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 2s. [2025-11-13 10:57:30,456][__main__][INFO] - Starting iteration 462. [2025-11-13 10:57:30,459][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:30,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:57:39,326][__main__][INFO] - Number of regex retries in iteration 462: 0 [2025-11-13 10:57:39,327][__main__][INFO] - agents played in iteration 462 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:57:39,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:39,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:39,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:39,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:57:39,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:57:39,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:57:40,940][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:57:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:57:41,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:57:41,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:57:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:57:42,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:57:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:57:43,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:57:43,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:57:43,848][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:57:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:57:44,496][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:57:44,825][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:57:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:57:45,474][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:57:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:57:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:57:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:57:46,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:57:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:57:47,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:57:47,749][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:57:48,074][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:57:48,399][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:57:48,722][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:57:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:57:49,371][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:57:49,695][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:57:50,019][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:57:50,343][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:57:50,668][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:57:50,991][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:57:51,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:57:52,044][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:57:52,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:57:52,777][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:57:52,778][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:57:53,751][__main__][INFO] - Iteration 463 took 23s (38.07% Gen, 57.75% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 30m 35s. Estimated total time: 19h 24m 40s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 6s. [2025-11-13 10:57:53,753][__main__][INFO] - Starting iteration 463. [2025-11-13 10:57:53,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:57:53,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:02,657][__main__][INFO] - Number of regex retries in iteration 463: 0 [2025-11-13 10:58:02,657][__main__][INFO] - agents played in iteration 463 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:58:03,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:03,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:03,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:03,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:03,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:03,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:04,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:04,525][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:04,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:05,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:05,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:06,154][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:07,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:07,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:08,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:08,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:09,070][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:09,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:09,723][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:10,378][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:11,028][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:11,354][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:11,678][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:12,328][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:12,653][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:12,978][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:13,628][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:13,953][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:14,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:15,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:15,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:15,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:15,703][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:16,677][__main__][INFO] - Iteration 464 took 22s (38.83% Gen, 56.92% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 11m 34s. Estimated total time: 19h 6m 2s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 12s, 500 more iterations: 3h 11m 0s. [2025-11-13 10:58:16,679][__main__][INFO] - Starting iteration 464. [2025-11-13 10:58:16,682][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:16,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:25,783][__main__][INFO] - Number of regex retries in iteration 464: 0 [2025-11-13 10:58:25,784][__main__][INFO] - agents played in iteration 464 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:58:26,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:26,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:26,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:26,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:26,659][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:26,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:27,395][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:27,691][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:28,015][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:28,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:28,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:28,986][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:29,638][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:29,961][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:30,284][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:30,606][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:30,928][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:31,252][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:31,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:31,902][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:32,229][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:32,553][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:32,877][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:33,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:33,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:33,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:34,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:34,498][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:34,824][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:35,148][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:35,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:35,797][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:58:36,448][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:58:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:58:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:58:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:58:37,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:58:38,481][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:58:39,202][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:58:39,203][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:58:39,205][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:58:40,209][__main__][INFO] - Iteration 465 took 23s (38.68% Gen, 57.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 41m 32s. Estimated total time: 19h 36m 23s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 3s. [2025-11-13 10:58:40,211][__main__][INFO] - Starting iteration 465. [2025-11-13 10:58:40,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:58:40,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:58:49,715][__main__][INFO] - Number of regex retries in iteration 465: 0 [2025-11-13 10:58:49,716][__main__][INFO] - agents played in iteration 465 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:58:50,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:50,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:50,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:50,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:58:50,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:58:50,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:58:51,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:58:51,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:58:51,631][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:58:51,957][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:58:52,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:58:52,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:58:52,939][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:58:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:58:53,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:58:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:58:54,232][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:58:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:58:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:58:55,207][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:58:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:58:55,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:58:56,179][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:58:56,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:58:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:58:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:58:57,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:58:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:58:58,130][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:58:58,454][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:58:58,778][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:58:59,102][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:58:59,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:58:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:00,076][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:00,401][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:00,726][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:01,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:02,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:02,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:02,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:02,832][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:03,805][__main__][INFO] - Iteration 466 took 23s (40.27% Gen, 55.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 44m 20s. Estimated total time: 19h 39m 34s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 35s. [2025-11-13 10:59:03,807][__main__][INFO] - Starting iteration 466. [2025-11-13 10:59:03,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:03,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:13,082][__main__][INFO] - Number of regex retries in iteration 466: 0 [2025-11-13 10:59:13,082][__main__][INFO] - agents played in iteration 466 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:59:13,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:13,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:13,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:13,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:13,632][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:13,632][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:14,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:15,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:16,263][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:16,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:18,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:18,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:18,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:19,182][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:19,509][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:20,488][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:21,138][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:21,462][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:22,112][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:22,436][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:23,085][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:23,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:23,734][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:24,063][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:24,386][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:24,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:25,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:26,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:26,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:26,167][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:27,235][__main__][INFO] - Iteration 467 took 23s (39.58% Gen, 55.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 35m 38s. Estimated total time: 19h 31m 16s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 12s. [2025-11-13 10:59:27,237][__main__][INFO] - Starting iteration 467. [2025-11-13 10:59:27,241][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:27,241][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:59:36,662][__main__][INFO] - Number of regex retries in iteration 467: 0 [2025-11-13 10:59:36,663][__main__][INFO] - agents played in iteration 467 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 10:59:37,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:37,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:37,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:37,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:59:37,228][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:59:37,229][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:59:37,970][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 10:59:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 10:59:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 10:59:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 10:59:39,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 10:59:39,564][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 10:59:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 10:59:40,213][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 10:59:40,538][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 10:59:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 10:59:41,188][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 10:59:41,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 10:59:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 10:59:42,163][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 10:59:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 10:59:42,817][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 10:59:43,145][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 10:59:43,469][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 10:59:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 10:59:44,119][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 10:59:44,443][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 10:59:44,769][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 10:59:45,094][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 10:59:45,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 10:59:45,741][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 10:59:46,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 10:59:46,391][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 10:59:46,716][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 10:59:47,039][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 10:59:47,365][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 10:59:47,689][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 10:59:48,014][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 10:59:48,340][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 10:59:49,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 10:59:49,799][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:59:49,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:59:49,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:59:51,082][__main__][INFO] - Iteration 468 took 23s (39.52% Gen, 55.11% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 5s. Estimated total time: 19h 52m 7s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 44s, 500 more iterations: 3h 18m 41s. [2025-11-13 10:59:51,084][__main__][INFO] - Starting iteration 468. [2025-11-13 10:59:51,088][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 10:59:51,088][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:00,337][__main__][INFO] - Number of regex retries in iteration 468: 0 [2025-11-13 11:00:00,338][__main__][INFO] - agents played in iteration 468 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:00:00,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:00,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:00,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:00,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:00,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:00,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:01,639][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:01,935][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:02,259][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:02,587][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:03,563][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:03,887][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:04,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:04,537][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:04,862][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:05,190][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:05,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:05,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:06,504][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:06,828][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:07,154][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:07,479][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:08,454][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:09,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:09,753][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:10,077][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:10,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:11,375][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:12,023][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:12,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:13,497][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:13,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:13,500][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:14,591][__main__][INFO] - Iteration 469 took 23s (39.35% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 38m 48s. Estimated total time: 19h 35m 13s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 10s, 500 more iterations: 3h 15m 52s. [2025-11-13 11:00:14,594][__main__][INFO] - Starting iteration 469. [2025-11-13 11:00:14,597][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:14,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:24,023][__main__][INFO] - Number of regex retries in iteration 469: 0 [2025-11-13 11:00:24,024][__main__][INFO] - agents played in iteration 469 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:00:24,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:24,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:24,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:24,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:24,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:24,583][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:25,301][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:25,597][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:25,922][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:26,568][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:26,893][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:27,218][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:27,545][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:28,197][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:28,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:28,851][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:29,178][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:29,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:29,832][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:30,156][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:30,481][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:30,806][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:31,131][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:31,456][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:31,781][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:32,432][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:32,757][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:33,083][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:33,407][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:34,379][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:35,355][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:35,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:36,398][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:37,149][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:37,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:37,153][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:00:38,295][__main__][INFO] - Iteration 470 took 23s (39.78% Gen, 55.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 6s. Estimated total time: 19h 44m 55s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 29s. [2025-11-13 11:00:38,297][__main__][INFO] - Starting iteration 470. [2025-11-13 11:00:38,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 11:00:38,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:00:46,836][__main__][INFO] - Number of regex retries in iteration 470: 0 [2025-11-13 11:00:46,837][__main__][INFO] - agents played in iteration 470 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:00:47,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:47,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:47,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:47,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:00:47,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:00:47,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:00:48,121][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:00:48,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:00:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:00:49,066][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:00:49,391][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:00:49,715][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:00:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:00:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:00:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:00:51,012][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:00:51,339][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:00:51,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:00:51,994][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:00:52,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:00:52,645][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:00:52,969][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:00:53,296][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:00:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:00:53,952][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:00:54,277][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:00:54,602][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:00:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:00:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:00:55,575][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:00:55,899][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:00:56,223][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:00:56,548][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:00:56,873][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:00:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:00:57,522][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:00:57,846][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:00:58,170][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:00:58,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:00:59,226][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:00:59,963][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:00:59,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:00:59,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:02,139][__main__][INFO] - Iteration 471 took 23s (35.81% Gen, 55.08% Train). Generation: 8s, Training: 13s. Estimated remaining time: 16h 54m 46s. Estimated total time: 19h 51m 59s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 43s, 500 more iterations: 3h 18m 39s. [2025-11-13 11:01:02,141][__main__][INFO] - Starting iteration 471. [2025-11-13 11:01:02,143][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:02,144][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:11,311][__main__][INFO] - Number of regex retries in iteration 471: 0 [2025-11-13 11:01:11,312][__main__][INFO] - agents played in iteration 471 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:01:11,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:12,210][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:12,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:13,571][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:13,895][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:14,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:14,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:14,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:15,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:15,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:15,851][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:16,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:16,830][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:17,154][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:18,457][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:18,782][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:19,107][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:19,431][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:19,756][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:20,080][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:20,405][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:20,730][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:21,054][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:21,377][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:21,702][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:22,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:22,350][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:22,675][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:22,999][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:23,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:24,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:24,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:24,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:24,817][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:25,875][__main__][INFO] - Iteration 472 took 23s (38.63% Gen, 56.90% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 48m 59s. Estimated total time: 19h 46m 36s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 46s. [2025-11-13 11:01:25,877][__main__][INFO] - Starting iteration 472. [2025-11-13 11:01:25,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:25,881][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:34,936][__main__][INFO] - Number of regex retries in iteration 472: 0 [2025-11-13 11:01:34,937][__main__][INFO] - agents played in iteration 472 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:01:35,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:35,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:35,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:36,639][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:01:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:01:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:01:37,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:01:37,940][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:01:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:01:38,590][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:01:38,917][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:01:39,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:01:39,564][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:01:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:01:40,218][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:01:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:01:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:01:41,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:01:41,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:01:41,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:01:42,168][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:01:42,492][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:01:42,816][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:01:43,142][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:01:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:01:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:01:44,116][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:01:44,441][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:01:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:01:45,089][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:01:45,415][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:01:45,741][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:01:46,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:01:46,388][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:01:46,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:01:47,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:01:48,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:01:48,175][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:01:48,177][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:01:49,194][__main__][INFO] - Iteration 473 took 23s (38.84% Gen, 56.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 27m 44s. Estimated total time: 19h 25m 44s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 17s. [2025-11-13 11:01:49,196][__main__][INFO] - Starting iteration 473. [2025-11-13 11:01:49,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:01:49,200][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:01:58,284][__main__][INFO] - Number of regex retries in iteration 473: 0 [2025-11-13 11:01:58,285][__main__][INFO] - agents played in iteration 473 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:01:58,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:58,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:58,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:58,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:01:58,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:01:58,852][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:01:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:01:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:00,513][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:00,839][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:01,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:01,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:02,148][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:02,475][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:02,804][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:03,458][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:03,782][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:04,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:04,434][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:05,085][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:05,409][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:05,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:06,058][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:06,383][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:06,708][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:07,033][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:07,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:08,333][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:09,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:10,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:11,412][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:11,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:11,415][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:12,430][__main__][INFO] - Iteration 474 took 23s (39.10% Gen, 56.52% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 23m 9s. Estimated total time: 19h 21m 32s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 43s, 500 more iterations: 3h 13m 35s. [2025-11-13 11:02:12,432][__main__][INFO] - Starting iteration 474. [2025-11-13 11:02:12,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:12,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:21,653][__main__][INFO] - Number of regex retries in iteration 474: 0 [2025-11-13 11:02:21,654][__main__][INFO] - agents played in iteration 474 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:02:22,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:22,225][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:22,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:22,957][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:23,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:23,579][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:24,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:25,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:26,194][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:26,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:26,853][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:27,180][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:27,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:28,160][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:28,811][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:29,134][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:29,458][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:29,782][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:30,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:30,755][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:31,079][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:31,405][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:31,728][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:32,051][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:32,699][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:33,348][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:34,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:34,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:34,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:34,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:35,809][__main__][INFO] - Iteration 475 took 23s (39.44% Gen, 56.25% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 29m 58s. Estimated total time: 19h 28m 45s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 47s. [2025-11-13 11:02:35,811][__main__][INFO] - Starting iteration 475. [2025-11-13 11:02:35,815][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:35,815][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:02:45,540][__main__][INFO] - Number of regex retries in iteration 475: 0 [2025-11-13 11:02:45,541][__main__][INFO] - agents played in iteration 475 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:02:46,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,070][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:02:46,105][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:02:46,105][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:02:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:02:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:02:47,446][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:02:47,770][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:02:48,093][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:02:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:02:48,743][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:02:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:02:49,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:02:49,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:02:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:02:50,371][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:02:50,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:02:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:02:51,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:02:51,670][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:02:51,997][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:02:52,322][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:02:52,646][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:02:52,971][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:02:53,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:02:53,621][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:02:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:02:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:02:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:02:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:02:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:02:55,569][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:02:55,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:02:56,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:02:56,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:02:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:02:57,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:02:57,923][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:02:58,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:02:58,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:02:58,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:02:59,785][__main__][INFO] - Iteration 476 took 23s (40.57% Gen, 55.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 59m 22s. Estimated total time: 19h 58m 33s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 57s, 500 more iterations: 3h 19m 45s. [2025-11-13 11:02:59,787][__main__][INFO] - Starting iteration 476. [2025-11-13 11:02:59,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:02:59,791][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:09,578][__main__][INFO] - Number of regex retries in iteration 476: 0 [2025-11-13 11:03:09,579][__main__][INFO] - agents played in iteration 476 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:03:10,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:10,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:10,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:10,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:10,158][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:10,159][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:11,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:11,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:12,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:14,095][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:14,421][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:14,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:15,072][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:15,396][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:15,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:16,695][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:17,020][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:17,671][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:17,996][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:18,319][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:18,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:19,293][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:19,620][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:19,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:20,271][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:20,596][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:20,923][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:21,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:21,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:22,707][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:22,708][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:22,710][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:03:23,701][__main__][INFO] - Iteration 477 took 23s (40.93% Gen, 54.92% Train). Generation: 9s, Training: 13s. Estimated remaining time: 16h 56m 0s. Estimated total time: 19h 55m 34s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 51s, 500 more iterations: 3h 19m 15s. [2025-11-13 11:03:23,703][__main__][INFO] - Starting iteration 477. [2025-11-13 11:03:23,706][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:03:23,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:03:33,037][__main__][INFO] - Number of regex retries in iteration 477: 0 [2025-11-13 11:03:33,038][__main__][INFO] - agents played in iteration 477 are Bob, Alice_buffer, Bob_buffer, Alice [2025-11-13 11:03:33,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.78%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:03:33,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:03:33,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:03:34,307][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:03:34,602][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:03:34,926][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:03:35,249][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:03:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:03:35,909][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:03:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:03:36,566][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:03:36,894][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:03:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:03:37,554][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:03:37,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:03:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:03:38,534][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:03:38,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:03:39,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:03:39,510][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:03:39,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:03:40,161][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:03:40,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:03:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:03:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:03:41,459][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:03:41,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:03:42,109][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:03:42,433][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:03:42,757][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:03:43,082][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:03:43,407][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:03:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:03:44,057][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:03:44,380][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:03:44,705][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:03:45,455][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 42.04%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:03:46,177][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:03:46,178][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:03:46,180][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:06:51,636][mllm.models.large_language_model_local][INFO] - Loaded 47 past agent adapters from checkpoints directory. [2025-11-13 11:07:10,514][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter'. [2025-11-13 11:07:11,599][mllm.models.adapter_training_wrapper][WARNING] - Adapter 'agent_adapter': failed to load from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/Qwen/Qwen2.5-7B-Instruct/adapters/agent_adapter': Error while deserializing header: MetadataIncompleteBuffer [2025-11-13 11:07:11,599][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2025-11-13 11:07:11,608][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': using existing weights from output folder '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:07:12,909][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': loaded initial weights from '/scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/Qwen/Qwen2.5-7B-Instruct/adapters/critic_adapter'. [2025-11-13 11:09:23,599][mllm.training.trainer_common][INFO] - Loading trainer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:09:23,602][mllm.training.trainer_common][INFO] - Loading policy optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:09:24,322][mllm.training.trainer_common][INFO] - Loading critic optimizer state from /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:09:24,325][__main__][INFO] - Starting iteration 477. [2025-11-13 11:09:24,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:09:24,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:09:56,854][__main__][INFO] - Number of regex retries in iteration 477: 0 [2025-11-13 11:09:56,854][__main__][INFO] - agents played in iteration 477 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:09:57,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:57,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:57,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:57,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 39.04%, Block Peak % of device VRAM: 19.44%, ΔTime: 00:00:00 [2025-11-13 11:09:57,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:09:57,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:09:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:09:58,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:09:58,943][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:09:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:09:59,589][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:09:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:00,236][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:00,559][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:01,204][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:01,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:01,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:02,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:02,513][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:02,839][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:03,169][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:04,154][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:04,481][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:04,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:05,140][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:05,464][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:05,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:06,117][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:06,770][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:07,098][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:07,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:08,084][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:08,409][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:08,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:09,414][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.78%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.98%, ΔTime: 00:00:11 [2025-11-13 11:10:10,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:10,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:10,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:11,272][__main__][INFO] - Iteration 478 took 46s (69.28% Gen, 28.44% Train). Generation: 32s, Training: 13s. Estimated remaining time: 39h 3m 51s. Estimated total time: 39h 7m 12s. Time estimates for 10 more iterations: 7m 49s, 100 more iterations: 1h 18m 14s, 500 more iterations: 6h 31m 12s. [2025-11-13 11:10:11,275][__main__][INFO] - Starting iteration 478. [2025-11-13 11:10:11,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:11,281][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:21,730][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2025-11-13 11:10:26,091][__main__][INFO] - Number of regex retries in iteration 478: 1 [2025-11-13 11:10:26,092][__main__][INFO] - agents played in iteration 478 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:10:26,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:26,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:26,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:27,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:27,602][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:27,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:28,246][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:28,568][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:29,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:29,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:29,872][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:30,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:30,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:30,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:31,157][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:31,804][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:32,124][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:32,448][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:10:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:10:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:10:33,738][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:10:34,062][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:10:34,379][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:10:34,701][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:10:35,024][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:10:35,356][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:10:35,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:10:35,996][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:10:36,319][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:10:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:10:36,974][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:10:37,296][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:10:37,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:10:38,304][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:10:39,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:10:39,030][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:10:39,031][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:10:39,949][__main__][INFO] - Iteration 479 took 28s (51.66% Gen, 45.13% Train). Generation: 14s, Training: 12s. Estimated remaining time: 23h 49m 42s. Estimated total time: 23h 53m 32s. Time estimates for 10 more iterations: 4m 46s, 100 more iterations: 47m 47s, 500 more iterations: 3h 58m 55s. [2025-11-13 11:10:39,951][__main__][INFO] - Starting iteration 479. [2025-11-13 11:10:39,954][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:10:39,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:10:53,066][__main__][INFO] - Number of regex retries in iteration 479: 0 [2025-11-13 11:10:53,067][__main__][INFO] - agents played in iteration 479 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:10:53,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:10:53,612][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:10:53,612][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:10:54,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:10:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:10:54,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:10:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:10:55,559][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:10:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:10:56,203][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:10:56,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:10:56,853][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:10:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:10:57,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:10:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:10:58,145][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:10:58,470][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:10:58,791][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:10:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:10:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:10:59,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:00,085][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:00,407][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:00,730][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:01,382][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:01,703][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:02,026][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:02,350][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:03,325][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:03,650][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:04,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:04,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:05,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:06,003][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:06,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:06,006][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:06,982][__main__][INFO] - Iteration 480 took 27s (48.51% Gen, 47.88% Train). Generation: 13s, Training: 12s. Estimated remaining time: 22h 27m 9s. Estimated total time: 22h 31m 26s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 2s, 500 more iterations: 3h 45m 14s. [2025-11-13 11:11:06,984][__main__][INFO] - Starting iteration 480. [2025-11-13 11:11:06,987][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 11:11:06,987][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:19,256][__main__][INFO] - Number of regex retries in iteration 480: 0 [2025-11-13 11:11:19,257][__main__][INFO] - agents played in iteration 480 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:11:19,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:19,789][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:19,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:20,463][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:21,080][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:21,401][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:22,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:22,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:22,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:23,335][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:23,658][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:24,301][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:24,624][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:24,946][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:25,268][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:25,590][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:25,913][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:26,234][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:26,556][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:26,878][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:27,199][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:27,522][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:27,849][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:28,171][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:28,813][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:29,137][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:29,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:30,109][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:30,437][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:30,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:31,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:10 [2025-11-13 11:11:32,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:32,143][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:32,145][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:33,723][__main__][INFO] - Iteration 481 took 26s (45.89% Gen, 48.20% Train). Generation: 12s, Training: 12s. Estimated remaining time: 22h 12m 7s. Estimated total time: 22h 16m 51s. Time estimates for 10 more iterations: 4m 27s, 100 more iterations: 44m 33s, 500 more iterations: 3h 42m 48s. [2025-11-13 11:11:33,725][__main__][INFO] - Starting iteration 481. [2025-11-13 11:11:33,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:33,728][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:11:44,995][__main__][INFO] - Number of regex retries in iteration 481: 0 [2025-11-13 11:11:44,996][__main__][INFO] - agents played in iteration 481 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:11:45,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:11:45,552][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:11:45,552][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:11:46,251][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:11:46,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:11:46,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:11:47,189][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:11:47,510][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:11:47,832][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:11:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:11:48,475][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:11:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:11:49,127][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:11:49,448][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:11:49,770][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:11:50,091][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:11:50,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:11:50,734][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:11:51,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:11:51,380][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:11:51,703][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:11:52,024][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:11:52,349][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:11:52,676][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:11:52,999][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:11:53,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:11:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:11:53,970][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:11:54,293][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:11:54,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:11:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:11:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:11:55,584][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:11:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:11:56,228][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:11:56,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:11:57,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:11:57,965][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:11:57,967][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:11:57,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:11:58,771][__main__][INFO] - Iteration 482 took 25s (44.99% Gen, 51.80% Train). Generation: 11s, Training: 12s. Estimated remaining time: 20h 47m 2s. Estimated total time: 20h 52m 11s. Time estimates for 10 more iterations: 4m 10s, 100 more iterations: 41m 44s, 500 more iterations: 3h 28m 41s. [2025-11-13 11:11:58,773][__main__][INFO] - Starting iteration 482. [2025-11-13 11:11:58,777][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:11:58,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:10,183][__main__][INFO] - Number of regex retries in iteration 482: 0 [2025-11-13 11:12:10,184][__main__][INFO] - agents played in iteration 482 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:12:10,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:10,721][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:10,721][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:11,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:11,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:12,325][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:12,646][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:13,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:13,627][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:14,276][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:14,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:14,926][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:15,250][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:16,559][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:16,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:17,204][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:17,538][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:18,180][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:18,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:18,828][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:19,155][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:19,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:20,452][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:21,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:21,420][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:21,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:22,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:23,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:23,126][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:23,128][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:23,928][__main__][INFO] - Iteration 483 took 25s (45.34% Gen, 51.46% Train). Generation: 11s, Training: 12s. Estimated remaining time: 20h 52m 1s. Estimated total time: 20h 57m 35s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 55s, 500 more iterations: 3h 29m 35s. [2025-11-13 11:12:23,930][__main__][INFO] - Starting iteration 483. [2025-11-13 11:12:23,932][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:23,933][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:34,328][__main__][INFO] - Number of regex retries in iteration 483: 0 [2025-11-13 11:12:34,329][__main__][INFO] - agents played in iteration 483 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:12:34,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:34,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:34,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:12:36,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:12:36,461][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:12:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:12:37,107][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:12:37,430][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:12:37,758][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:12:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:12:38,400][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:12:38,721][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:12:39,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:12:39,366][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:12:39,687][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:12:40,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:12:40,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:12:40,654][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:12:40,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:12:41,298][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:12:41,621][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:12:41,944][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:12:42,268][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:12:42,595][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:12:42,919][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:12:43,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:12:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:12:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:12:44,217][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:12:44,539][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:12:44,864][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:12:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:12:45,509][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:12:45,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:12:46,534][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:12:47,209][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:12:47,211][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:12:47,212][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:12:48,028][__main__][INFO] - Iteration 484 took 24s (43.14% Gen, 53.47% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 58m 51s. Estimated total time: 20h 4m 50s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 9s, 500 more iterations: 3h 20m 48s. [2025-11-13 11:12:48,030][__main__][INFO] - Starting iteration 484. [2025-11-13 11:12:48,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:12:48,034][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:12:58,399][__main__][INFO] - Number of regex retries in iteration 484: 0 [2025-11-13 11:12:58,400][__main__][INFO] - agents played in iteration 484 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:12:58,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:12:58,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:12:58,944][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:12:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:12:59,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:00,212][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:00,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:01,183][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:01,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:01,832][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:02,159][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:02,481][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:02,804][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:03,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:03,447][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:03,771][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:04,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:04,738][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:05,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:05,383][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:05,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:06,031][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:06,356][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:06,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:07,007][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:07,979][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:08,631][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:08,954][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:09,278][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:09,601][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:09,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:10,607][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:11,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:11,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:11,319][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:12,159][__main__][INFO] - Iteration 485 took 24s (42.96% Gen, 53.55% Train). Generation: 10s, Training: 12s. Estimated remaining time: 20h 0m 0s. Estimated total time: 20h 6m 22s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 12s, 500 more iterations: 3h 21m 3s. [2025-11-13 11:13:12,161][__main__][INFO] - Starting iteration 485. [2025-11-13 11:13:12,164][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:12,165][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:22,304][__main__][INFO] - Number of regex retries in iteration 485: 0 [2025-11-13 11:13:22,304][__main__][INFO] - agents played in iteration 485 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:13:22,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:22,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:22,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:23,476][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:23,770][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:24,093][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:24,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:25,067][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:26,029][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:26,350][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:26,672][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:26,994][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:27,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:27,967][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:28,282][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:28,605][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:28,928][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:29,255][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:29,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:29,897][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:30,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:30,872][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:31,200][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:31,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:32,170][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:32,494][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:32,817][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:33,141][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:33,788][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:34,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:35,172][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:35,174][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:35,175][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:35,984][__main__][INFO] - Iteration 486 took 23s (42.57% Gen, 54.03% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 44m 15s. Estimated total time: 19h 51m 2s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 42s, 500 more iterations: 3h 18m 30s. [2025-11-13 11:13:35,986][__main__][INFO] - Starting iteration 486. [2025-11-13 11:13:35,989][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:35,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:13:45,727][__main__][INFO] - Number of regex retries in iteration 486: 0 [2025-11-13 11:13:45,728][__main__][INFO] - agents played in iteration 486 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:13:46,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:46,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:46,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:46,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:13:46,390][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:13:46,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:13:47,043][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:13:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:13:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:13:47,980][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:13:48,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:13:48,625][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:13:48,947][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:13:49,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:13:49,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:13:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:13:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:13:50,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:13:50,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:13:51,212][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:13:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:13:51,862][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:13:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:13:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:13:52,841][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:13:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:13:53,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:13:53,814][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:13:54,138][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:13:54,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:13:54,785][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:13:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:13:55,431][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:13:55,758][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:13:56,081][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:13:56,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:13:56,727][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:13:57,053][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:13:57,376][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:13:58,066][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:13:58,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:13:58,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:13:58,763][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:13:59,574][__main__][INFO] - Iteration 487 took 23s (41.29% Gen, 55.27% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 32m 7s. Estimated total time: 19h 39m 17s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 32s. [2025-11-13 11:13:59,576][__main__][INFO] - Starting iteration 487. [2025-11-13 11:13:59,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:13:59,579][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:08,644][__main__][INFO] - Number of regex retries in iteration 487: 0 [2025-11-13 11:14:08,645][__main__][INFO] - agents played in iteration 487 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:14:09,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:09,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:09,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:09,854][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:10,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:10,472][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:10,793][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:11,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:11,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:11,759][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:12,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:12,404][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:12,726][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:13,052][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:13,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:13,698][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:14,020][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:14,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:14,990][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:15,314][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:15,637][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:16,295][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:16,621][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:16,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:17,925][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:18,249][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:18,572][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:18,901][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:19,224][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:19,559][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:19,882][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:20,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:20,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:21,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:21,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:21,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:22,422][__main__][INFO] - Iteration 488 took 22s (39.69% Gen, 56.71% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 54m 41s. Estimated total time: 19h 2m 14s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 4s, 500 more iterations: 3h 10m 22s. [2025-11-13 11:14:22,424][__main__][INFO] - Starting iteration 488. [2025-11-13 11:14:22,427][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:22,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:33,010][__main__][INFO] - Number of regex retries in iteration 488: 0 [2025-11-13 11:14:33,011][__main__][INFO] - agents played in iteration 488 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:14:33,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:33,538][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:33,539][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:34,198][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:34,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:35,137][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:35,461][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:35,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:36,107][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:36,752][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:14:37,398][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:14:37,720][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:14:38,052][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:14:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:14:38,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:14:39,027][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:14:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:14:39,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:14:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:14:40,337][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:14:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:14:40,999][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:14:41,323][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:14:41,646][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:14:41,974][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:14:42,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:14:42,621][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:14:42,946][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:14:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:14:43,597][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:14:43,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:14:44,246][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:14:44,577][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:14:45,270][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:14:45,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:14:45,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:14:45,963][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:14:46,776][__main__][INFO] - Iteration 489 took 24s (43.46% Gen, 53.19% Train). Generation: 10s, Training: 12s. Estimated remaining time: 20h 9m 31s. Estimated total time: 20h 17m 28s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 34s, 500 more iterations: 3h 22m 54s. [2025-11-13 11:14:46,778][__main__][INFO] - Starting iteration 489. [2025-11-13 11:14:46,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:14:46,781][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:14:55,743][__main__][INFO] - Number of regex retries in iteration 489: 0 [2025-11-13 11:14:55,743][__main__][INFO] - agents played in iteration 489 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:14:56,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:56,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:56,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:56,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:14:56,262][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:14:56,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:14:56,952][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:14:57,246][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:14:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:14:57,892][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:14:58,220][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:14:58,548][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:14:58,873][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:14:59,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:14:59,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:14:59,844][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:00,170][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:00,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:01,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:01,797][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:02,123][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:02,447][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:02,771][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:03,419][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:04,066][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:04,391][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:04,715][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:05,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:05,363][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:05,689][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:06,336][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:07,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:08,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:08,726][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:08,728][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:08,729][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:09,559][__main__][INFO] - Iteration 490 took 22s (39.34% Gen, 57.01% Train). Generation: 8s, Training: 12s. Estimated remaining time: 18h 50m 37s. Estimated total time: 18h 58m 57s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 57s, 500 more iterations: 3h 9m 49s. [2025-11-13 11:15:09,561][__main__][INFO] - Starting iteration 490. [2025-11-13 11:15:09,563][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 11:15:09,564][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:19,589][__main__][INFO] - Number of regex retries in iteration 490: 0 [2025-11-13 11:15:19,589][__main__][INFO] - agents played in iteration 490 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:15:20,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:20,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:20,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:20,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:20,124][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:20,125][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:21,085][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:21,410][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:22,056][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:22,381][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:22,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:23,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:23,356][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:24,323][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:24,648][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:24,974][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:25,300][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:25,624][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:26,602][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:26,925][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:27,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:27,900][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:28,226][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:28,877][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:29,525][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:29,852][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:30,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:30,499][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:30,822][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:31,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:31,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:32,596][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:32,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:32,598][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:34,483][__main__][INFO] - Iteration 491 took 24s (40.23% Gen, 52.20% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 37m 17s. Estimated total time: 20h 46m 2s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 32s, 500 more iterations: 3h 27m 40s. [2025-11-13 11:15:34,486][__main__][INFO] - Starting iteration 491. [2025-11-13 11:15:34,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:34,489][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:15:43,660][__main__][INFO] - Number of regex retries in iteration 491: 0 [2025-11-13 11:15:43,661][__main__][INFO] - agents played in iteration 491 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:15:44,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:44,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:44,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:44,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:15:44,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:15:44,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:15:44,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:15:45,158][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:15:45,485][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:15:45,808][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:15:46,131][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:15:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:15:46,778][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:15:47,101][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:15:47,422][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:15:47,745][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:15:48,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:15:48,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:15:48,715][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:15:49,039][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:15:49,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:15:49,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:15:50,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:15:50,335][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:15:50,659][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:15:50,983][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:15:51,312][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:15:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:15:51,962][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:15:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:15:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:15:52,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:15:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:15:53,595][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:15:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:15:54,241][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:15:54,566][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:15:54,896][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:15:55,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:15:55,950][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:15:56,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:15:56,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:15:56,629][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:15:57,543][__main__][INFO] - Iteration 492 took 23s (39.78% Gen, 56.25% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 3m 40s. Estimated total time: 19h 12m 48s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 8s. [2025-11-13 11:15:57,545][__main__][INFO] - Starting iteration 492. [2025-11-13 11:15:57,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:15:57,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:07,375][__main__][INFO] - Number of regex retries in iteration 492: 0 [2025-11-13 11:16:07,376][__main__][INFO] - agents played in iteration 492 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:16:07,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:07,910][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:07,911][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:08,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:09,194][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:09,517][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:10,484][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:10,808][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:11,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:12,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:12,428][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:12,753][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:13,076][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:13,401][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:13,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:14,050][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:14,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:15,027][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:15,351][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:15,675][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:16,651][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:16,975][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:17,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:17,627][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:17,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:18,277][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:18,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:19,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:20,312][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:20,313][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:20,315][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:21,219][__main__][INFO] - Iteration 493 took 23s (41.51% Gen, 54.66% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 34m 5s. Estimated total time: 19h 43m 36s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 27s, 500 more iterations: 3h 17m 16s. [2025-11-13 11:16:21,222][__main__][INFO] - Starting iteration 493. [2025-11-13 11:16:21,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:21,225][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:31,044][__main__][INFO] - Number of regex retries in iteration 493: 0 [2025-11-13 11:16:31,044][__main__][INFO] - agents played in iteration 493 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:16:31,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:31,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:31,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:31,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:31,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:31,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:32,864][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:33,845][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:34,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:34,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:34,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:35,149][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:35,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:35,800][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:16:36,123][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:16:36,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:16:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:16:37,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:16:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:16:37,751][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:16:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:16:38,403][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:16:38,726][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:16:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:16:39,375][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:16:39,705][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:16:40,030][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:16:40,354][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:16:40,679][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:16:41,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:16:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:16:41,660][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:16:41,984][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:16:42,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:16:42,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:16:43,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:16:44,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:16:44,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:16:44,049][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:16:44,853][__main__][INFO] - Iteration 494 took 23s (41.55% Gen, 55.04% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 31m 33s. Estimated total time: 19h 41m 28s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 22s, 500 more iterations: 3h 16m 54s. [2025-11-13 11:16:44,855][__main__][INFO] - Starting iteration 494. [2025-11-13 11:16:44,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:16:44,858][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:16:55,030][__main__][INFO] - Number of regex retries in iteration 494: 0 [2025-11-13 11:16:55,031][__main__][INFO] - agents played in iteration 494 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:16:55,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:55,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:55,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:55,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:16:55,564][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:16:55,565][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:16:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:16:56,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:16:56,864][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:16:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:16:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:16:57,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:16:58,160][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:16:58,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:16:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:16:59,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:16:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:16:59,783][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:00,107][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:00,432][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:00,755][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:01,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:01,403][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:01,728][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:02,052][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:02,700][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:03,348][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:04,329][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:04,652][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:05,307][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:05,630][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:06,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:06,605][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:07,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:08,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:08,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:08,019][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:09,084][__main__][INFO] - Iteration 495 took 24s (41.99% Gen, 53.61% Train). Generation: 10s, Training: 12s. Estimated remaining time: 20h 1m 2s. Estimated total time: 20h 11m 21s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 22s, 500 more iterations: 3h 21m 53s. [2025-11-13 11:17:09,085][__main__][INFO] - Starting iteration 495. [2025-11-13 11:17:09,088][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:09,089][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:18,293][__main__][INFO] - Number of regex retries in iteration 495: 0 [2025-11-13 11:17:18,293][__main__][INFO] - agents played in iteration 495 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:17:18,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:18,756][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:18,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:18,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:18,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:18,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:19,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:20,143][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:20,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:20,799][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:21,450][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:21,776][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:22,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:22,425][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:23,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:23,723][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:24,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:24,373][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:25,024][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:25,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:26,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:26,324][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:26,970][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:27,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:27,617][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:28,913][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:29,564][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:29,888][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:30,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:31,273][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:31,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:31,276][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:32,157][__main__][INFO] - Iteration 496 took 23s (39.90% Gen, 56.28% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 2m 46s. Estimated total time: 19h 13m 28s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 26s, 500 more iterations: 3h 12m 14s. [2025-11-13 11:17:32,159][__main__][INFO] - Starting iteration 496. [2025-11-13 11:17:32,162][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:32,163][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:17:42,050][__main__][INFO] - Number of regex retries in iteration 496: 0 [2025-11-13 11:17:42,051][__main__][INFO] - agents played in iteration 496 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:17:42,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:42,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:42,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:42,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:17:42,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:17:42,583][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:17:43,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:17:43,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:17:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:17:44,230][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:17:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:17:44,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:17:45,202][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:17:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:17:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:17:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:17:46,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:17:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:17:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:17:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:17:47,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:17:48,122][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:17:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:17:48,772][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:17:49,095][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:17:49,420][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:17:49,744][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:17:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:17:50,398][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:17:50,723][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:17:51,048][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:17:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:17:51,699][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:17:52,028][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:17:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:17:52,679][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:17:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:17:53,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:17:53,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:17:54,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:17:55,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:17:55,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:17:55,072][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:17:55,933][__main__][INFO] - Iteration 497 took 23s (41.60% Gen, 54.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 37m 28s. Estimated total time: 19h 48m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 5s. [2025-11-13 11:17:55,934][__main__][INFO] - Starting iteration 497. [2025-11-13 11:17:55,937][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:17:55,938][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:05,211][__main__][INFO] - Number of regex retries in iteration 497: 0 [2025-11-13 11:18:05,212][__main__][INFO] - agents played in iteration 497 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:18:05,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:05,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:05,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:05,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:05,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:05,774][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:06,473][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:06,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:07,092][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:07,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:07,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:08,063][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:08,387][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:08,710][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:09,035][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:09,359][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:10,006][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:10,329][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:11,626][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:12,272][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:12,600][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:12,925][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:13,251][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:13,575][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:13,901][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:14,224][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:14,547][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:14,872][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:15,196][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:15,521][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:15,849][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:16,491][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:16,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:17,528][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:18,207][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:18,208][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:18,209][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:19,019][__main__][INFO] - Iteration 498 took 23s (40.18% Gen, 56.31% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 2m 39s. Estimated total time: 19h 14m 8s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 21s. [2025-11-13 11:18:19,022][__main__][INFO] - Starting iteration 498. [2025-11-13 11:18:19,024][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:19,025][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:29,341][__main__][INFO] - Number of regex retries in iteration 498: 0 [2025-11-13 11:18:29,342][__main__][INFO] - agents played in iteration 498 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:18:29,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:29,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:29,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:29,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:29,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:29,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:30,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:30,891][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:31,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:31,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:31,861][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:32,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:32,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:33,156][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:33,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:33,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:34,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:34,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:34,774][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:35,426][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:18:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:18:36,719][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:18:37,043][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:18:37,372][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:18:37,697][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:18:38,027][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:18:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:18:38,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:18:38,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:18:39,318][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:18:39,641][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:18:39,963][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:18:40,289][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:18:40,612][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:18:40,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:18:41,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:18:42,355][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:18:42,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:18:42,358][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:18:43,156][__main__][INFO] - Iteration 499 took 24s (42.75% Gen, 53.94% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 54m 44s. Estimated total time: 20h 6m 38s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 13s, 500 more iterations: 3h 21m 6s. [2025-11-13 11:18:43,158][__main__][INFO] - Starting iteration 499. [2025-11-13 11:18:43,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:18:43,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:18:52,994][__main__][INFO] - Number of regex retries in iteration 499: 0 [2025-11-13 11:18:52,995][__main__][INFO] - agents played in iteration 499 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:18:53,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:53,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:53,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:53,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:18:53,553][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:18:53,553][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:18:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:18:54,564][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:18:54,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:18:55,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:18:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:18:55,863][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:18:56,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:18:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:18:56,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:18:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:18:57,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:18:57,802][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:18:58,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:18:58,456][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:18:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:18:59,105][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:18:59,429][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:18:59,753][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:00,401][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:00,731][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:01,055][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:01,378][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:01,702][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:02,028][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:02,999][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:03,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:03,655][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:03,979][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:04,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:05,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:06,039][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:06,040][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:06,042][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:06,911][__main__][INFO] - Iteration 500 took 23s (41.40% Gen, 54.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 35m 18s. Estimated total time: 19h 47m 35s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 35s, 500 more iterations: 3h 17m 55s. [2025-11-13 11:19:06,913][__main__][INFO] - Starting iteration 500. [2025-11-13 11:19:06,916][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 11:19:06,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:15,830][__main__][INFO] - Number of regex retries in iteration 500: 0 [2025-11-13 11:19:15,831][__main__][INFO] - agents played in iteration 500 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:19:16,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:16,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:16,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:16,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:16,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:16,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:17,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:17,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:17,706][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:18,678][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:19,325][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:19,650][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:19,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:20,299][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:20,624][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:21,271][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:21,925][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:22,580][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:22,914][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:23,244][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:23,897][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:24,232][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:24,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:24,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:25,205][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:25,862][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:26,187][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:26,512][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:26,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:27,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:27,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:28,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:28,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:28,880][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:28,881][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:30,559][__main__][INFO] - Iteration 501 took 23s (37.70% Gen, 55.19% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 29m 29s. Estimated total time: 19h 42m 10s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 24s, 500 more iterations: 3h 17m 1s. [2025-11-13 11:19:30,561][__main__][INFO] - Starting iteration 501. [2025-11-13 11:19:30,564][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:30,564][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:19:39,898][__main__][INFO] - Number of regex retries in iteration 501: 0 [2025-11-13 11:19:39,898][__main__][INFO] - agents played in iteration 501 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:19:40,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:40,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:40,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:40,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:19:40,434][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:19:40,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:19:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:19:41,408][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:19:41,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:19:42,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:19:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:19:42,723][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:19:43,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:19:43,373][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:19:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:19:44,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:19:44,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:19:44,677][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:19:45,002][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:19:45,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:19:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:19:45,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:19:46,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:19:46,641][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:19:46,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:19:47,291][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:19:47,616][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:19:47,946][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:19:48,273][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:19:48,596][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:19:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:19:49,246][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:19:49,570][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:19:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:19:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:19:50,543][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:19:50,867][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:19:51,194][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:19:51,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:19:52,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:19:52,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:19:52,922][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:19:52,924][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:19:53,766][__main__][INFO] - Iteration 502 took 23s (40.23% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 3s. Estimated total time: 19h 20m 7s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 21s. [2025-11-13 11:19:53,768][__main__][INFO] - Starting iteration 502. [2025-11-13 11:19:53,770][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:19:53,771][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:03,129][__main__][INFO] - Number of regex retries in iteration 502: 0 [2025-11-13 11:20:03,129][__main__][INFO] - agents played in iteration 502 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:20:03,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:03,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:03,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:03,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:03,657][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:03,657][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:04,336][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:04,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:05,280][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:05,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:05,930][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:06,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:06,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:07,229][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:07,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:07,887][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:08,212][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:08,537][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:08,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:09,185][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:10,162][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:10,487][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:10,812][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:11,137][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:11,461][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:11,785][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:12,108][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:12,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:12,754][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:13,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:13,730][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:14,053][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:14,381][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:14,708][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:15,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:16,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:16,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:16,123][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:16,962][__main__][INFO] - Iteration 503 took 23s (40.35% Gen, 56.02% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 6m 12s. Estimated total time: 19h 19m 39s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 39s, 500 more iterations: 3h 13m 16s. [2025-11-13 11:20:16,964][__main__][INFO] - Starting iteration 503. [2025-11-13 11:20:16,967][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:16,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:26,436][__main__][INFO] - Number of regex retries in iteration 503: 0 [2025-11-13 11:20:26,436][__main__][INFO] - agents played in iteration 503 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:20:26,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:26,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:26,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:26,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:26,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:26,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:27,671][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:27,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:28,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:28,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:28,936][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:29,258][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:29,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:30,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:30,883][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:31,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:31,545][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:31,871][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:32,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:32,853][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:33,181][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:33,506][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:33,829][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:34,479][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:34,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:35,125][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:35,454][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:35,778][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:36,102][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:20:36,426][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:20:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:20:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:20:37,400][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:20:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:20:38,053][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:20:38,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:20:39,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:20:39,454][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:20:39,457][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:20:40,316][__main__][INFO] - Iteration 504 took 23s (40.55% Gen, 55.76% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 13m 38s. Estimated total time: 19h 27m 28s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 54s, 500 more iterations: 3h 14m 34s. [2025-11-13 11:20:40,318][__main__][INFO] - Starting iteration 504. [2025-11-13 11:20:40,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:20:40,321][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:20:50,169][__main__][INFO] - Number of regex retries in iteration 504: 0 [2025-11-13 11:20:50,169][__main__][INFO] - agents played in iteration 504 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:20:50,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:50,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:50,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:50,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:20:50,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:20:50,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:20:51,413][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:20:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:20:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:20:52,364][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:20:52,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:20:53,007][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:20:53,333][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:20:53,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:20:53,984][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:20:54,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:20:54,640][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:20:54,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:20:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:20:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:20:55,950][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:20:56,281][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:20:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:20:56,927][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:20:57,250][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:20:57,574][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:20:57,898][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:20:58,222][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:20:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:20:58,876][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:20:59,195][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:20:59,518][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:20:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:00,173][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:00,497][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:00,822][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:01,471][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:01,797][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:02,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:03,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:03,172][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:03,173][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:03,978][__main__][INFO] - Iteration 505 took 23s (41.62% Gen, 54.97% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 28m 42s. Estimated total time: 19h 42m 56s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 25s, 500 more iterations: 3h 17m 9s. [2025-11-13 11:21:03,980][__main__][INFO] - Starting iteration 505. [2025-11-13 11:21:03,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:03,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:13,349][__main__][INFO] - Number of regex retries in iteration 505: 0 [2025-11-13 11:21:13,350][__main__][INFO] - agents played in iteration 505 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:21:13,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:13,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:13,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:13,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:13,902][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:13,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:14,580][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:14,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:15,521][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:15,847][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:16,176][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:16,501][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:18,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:18,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:18,767][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:19,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:19,415][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:19,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:20,389][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:20,714][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:21,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:21,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:21,688][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:22,658][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:22,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:23,307][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:23,956][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:24,281][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:24,607][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:24,931][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:25,628][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:26,307][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:26,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:26,312][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:27,192][__main__][INFO] - Iteration 506 took 23s (40.35% Gen, 55.85% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 5m 52s. Estimated total time: 19h 20m 30s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 41s, 500 more iterations: 3h 13m 25s. [2025-11-13 11:21:27,194][__main__][INFO] - Starting iteration 506. [2025-11-13 11:21:27,197][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:27,198][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:21:36,678][__main__][INFO] - Number of regex retries in iteration 506: 0 [2025-11-13 11:21:36,679][__main__][INFO] - agents played in iteration 506 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:21:37,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:37,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:37,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:37,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:21:37,208][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:21:37,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:21:37,895][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:21:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:21:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:21:38,838][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:21:39,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:21:39,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:21:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:21:40,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:21:40,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:21:40,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:21:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:21:41,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:21:41,770][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:21:42,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:21:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:21:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:21:43,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:21:43,397][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:21:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:21:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:21:44,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:21:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:21:45,021][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:21:45,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:21:45,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:21:45,997][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:21:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:21:46,643][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:21:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:21:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:21:47,614][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:21:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:21:48,260][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:21:48,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:21:49,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:21:49,634][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:21:49,636][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:21:50,461][__main__][INFO] - Iteration 507 took 23s (40.75% Gen, 55.69% Train). Generation: 9s, Training: 12s. Estimated remaining time: 19h 8m 14s. Estimated total time: 19h 23m 15s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 52s. [2025-11-13 11:21:50,463][__main__][INFO] - Starting iteration 507. [2025-11-13 11:21:50,466][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:21:50,466][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:00,483][__main__][INFO] - Number of regex retries in iteration 507: 0 [2025-11-13 11:22:00,483][__main__][INFO] - agents played in iteration 507 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:22:00,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:00,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:00,993][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:01,026][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:01,026][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:01,027][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:01,696][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:01,991][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:02,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:02,639][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:02,965][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:03,295][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:03,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:04,270][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:04,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:04,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:05,253][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:05,902][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:06,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:06,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:06,873][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:07,520][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:07,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:08,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:08,504][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:08,827][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:09,477][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:09,800][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:10,123][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:11,093][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:12,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:12,787][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:13,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:13,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:13,459][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:14,276][__main__][INFO] - Iteration 508 took 23s (42.06% Gen, 54.50% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 35m 9s. Estimated total time: 19h 50m 33s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 41s, 500 more iterations: 3h 18m 25s. [2025-11-13 11:22:14,278][__main__][INFO] - Starting iteration 508. [2025-11-13 11:22:14,282][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:14,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:24,532][__main__][INFO] - Number of regex retries in iteration 508: 0 [2025-11-13 11:22:24,533][__main__][INFO] - agents played in iteration 508 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:22:24,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:24,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:25,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:25,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:25,051][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:25,051][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:25,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:26,063][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:26,710][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:27,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:27,692][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:28,021][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:28,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:28,669][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:28,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:29,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:29,644][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:29,974][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:30,298][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:30,945][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:31,275][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:31,604][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:32,574][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:32,900][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:33,223][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:33,873][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:34,197][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:34,843][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:35,169][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:35,494][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:35,818][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:36,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:36,864][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:22:37,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:22:37,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:22:37,547][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:22:38,644][__main__][INFO] - Iteration 509 took 24s (42.07% Gen, 53.42% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 2m 20s. Estimated total time: 20h 18m 9s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 1s. [2025-11-13 11:22:38,645][__main__][INFO] - Starting iteration 509. [2025-11-13 11:22:38,648][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:22:38,649][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:22:47,657][__main__][INFO] - Number of regex retries in iteration 509: 0 [2025-11-13 11:22:47,658][__main__][INFO] - agents played in iteration 509 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:22:48,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:48,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:48,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:48,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:22:48,184][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:22:48,184][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:22:48,878][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:22:49,174][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:22:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:22:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:22:50,152][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:22:50,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:22:50,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:22:51,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:22:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:22:51,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:22:52,108][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:22:52,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:22:52,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:22:53,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:22:53,408][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:22:53,729][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:22:54,053][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:22:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:22:54,701][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:22:55,025][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:22:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:22:55,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:22:56,008][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:22:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:22:56,649][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:22:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:22:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:22:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:22:57,942][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:22:58,268][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:22:58,594][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:22:58,919][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:22:59,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:22:59,960][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:00,636][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:00,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:00,639][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:01,457][__main__][INFO] - Iteration 510 took 22s (39.50% Gen, 56.92% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 44m 16s. Estimated total time: 19h 0m 27s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 0s, 500 more iterations: 3h 10m 4s. [2025-11-13 11:23:01,458][__main__][INFO] - Starting iteration 510. [2025-11-13 11:23:01,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 11:23:01,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:11,135][__main__][INFO] - Number of regex retries in iteration 510: 0 [2025-11-13 11:23:11,136][__main__][INFO] - agents played in iteration 510 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:23:11,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:11,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:11,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:11,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:11,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:11,676][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:12,389][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:12,685][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:13,332][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:13,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:14,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:14,633][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:14,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:15,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:15,928][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:16,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:16,900][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:17,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:17,875][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:18,523][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:18,848][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:19,501][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:19,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:20,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:21,128][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:21,457][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:21,780][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:22,104][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:22,430][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:22,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:23,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:24,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:24,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:24,173][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:25,778][__main__][INFO] - Iteration 511 took 24s (39.78% Gen, 53.61% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 59m 18s. Estimated total time: 20h 15m 54s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 31s, 500 more iterations: 3h 22m 39s. [2025-11-13 11:23:25,780][__main__][INFO] - Starting iteration 511. [2025-11-13 11:23:25,783][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:25,784][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:34,573][__main__][INFO] - Number of regex retries in iteration 511: 0 [2025-11-13 11:23:34,574][__main__][INFO] - agents played in iteration 511 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:23:35,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:35,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:35,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:35,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:35,108][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:35,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:23:36,757][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:23:37,081][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:23:37,404][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:23:37,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:23:38,051][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:23:38,386][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:23:38,709][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:23:39,032][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:23:39,356][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:23:39,688][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:23:40,012][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:23:40,335][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:23:40,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:23:40,990][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:23:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:23:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:23:41,965][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:23:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:23:42,611][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:23:42,934][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:23:43,258][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:23:43,591][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:23:43,916][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:23:44,238][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:23:44,564][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:23:44,894][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:23:45,222][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:23:45,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:23:45,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:23:46,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:23:46,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:23:47,587][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:23:47,589][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:23:47,590][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:23:48,397][__main__][INFO] - Iteration 512 took 22s (38.87% Gen, 57.56% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 33m 44s. Estimated total time: 18h 50m 43s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 41s, 500 more iterations: 3h 8m 27s. [2025-11-13 11:23:48,399][__main__][INFO] - Starting iteration 512. [2025-11-13 11:23:48,403][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:23:48,403][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:23:57,785][__main__][INFO] - Number of regex retries in iteration 512: 0 [2025-11-13 11:23:57,786][__main__][INFO] - agents played in iteration 512 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:23:58,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:58,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:58,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:58,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:23:58,334][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:23:58,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:23:59,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:23:59,352][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:23:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:00,000][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:00,325][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:01,295][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:01,624][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:01,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:02,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:02,597][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:02,921][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:03,244][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:03,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:03,891][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:04,215][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:04,861][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:05,506][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:05,830][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:06,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:06,476][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:07,779][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:08,104][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:08,757][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:09,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:09,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:10,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:10,791][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:10,796][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:10,797][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:11,625][__main__][INFO] - Iteration 513 took 23s (40.40% Gen, 56.03% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 3m 48s. Estimated total time: 19h 21m 10s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 31s. [2025-11-13 11:24:11,627][__main__][INFO] - Starting iteration 513. [2025-11-13 11:24:11,630][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:11,630][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:21,584][__main__][INFO] - Number of regex retries in iteration 513: 0 [2025-11-13 11:24:21,584][__main__][INFO] - agents played in iteration 513 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:24:22,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:22,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:22,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:22,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:22,131][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:22,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:22,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:23,144][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:23,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:24,118][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:24,441][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:25,087][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:25,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:26,700][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:27,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:27,670][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:28,650][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:28,974][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:29,299][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:29,622][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:29,947][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:30,917][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:31,244][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:31,568][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:31,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:32,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:32,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:33,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:33,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:34,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:34,604][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:34,605][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:35,424][__main__][INFO] - Iteration 514 took 23s (41.83% Gen, 54.72% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 32m 0s. Estimated total time: 19h 49m 46s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 39s, 500 more iterations: 3h 18m 17s. [2025-11-13 11:24:35,426][__main__][INFO] - Starting iteration 514. [2025-11-13 11:24:35,430][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:35,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:24:45,121][__main__][INFO] - Number of regex retries in iteration 514: 0 [2025-11-13 11:24:45,122][__main__][INFO] - agents played in iteration 514 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:24:45,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:45,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:45,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:45,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:24:45,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:24:45,692][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:24:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:24:46,703][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:24:47,027][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:24:47,351][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:24:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:24:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:24:48,320][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:24:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:24:48,966][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:24:49,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:24:49,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:24:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:24:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:24:50,582][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:24:50,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:24:51,231][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:24:51,561][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:24:51,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:24:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:24:52,543][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:24:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:24:53,195][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:24:53,524][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:24:53,842][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:24:54,165][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:24:54,487][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:24:54,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:24:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:24:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:24:55,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:24:56,105][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:24:56,429][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:24:56,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:24:57,468][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:24:58,172][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:24:58,176][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:24:58,178][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:24:59,033][__main__][INFO] - Iteration 515 took 23s (41.06% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 22m 1s. Estimated total time: 19h 40m 11s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 41s. [2025-11-13 11:24:59,035][__main__][INFO] - Starting iteration 515. [2025-11-13 11:24:59,038][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:24:59,039][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:08,416][__main__][INFO] - Number of regex retries in iteration 515: 0 [2025-11-13 11:25:08,417][__main__][INFO] - agents played in iteration 515 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:25:08,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:08,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:08,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:08,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:08,953][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:08,954][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:09,694][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:10,312][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:10,960][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:11,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:11,608][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:11,932][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:12,256][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:12,904][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:13,227][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:13,551][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:13,876][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:14,200][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:15,507][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:16,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:16,806][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:17,129][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:17,454][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:17,778][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:18,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:18,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:19,076][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:19,400][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:20,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:20,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:21,440][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:21,442][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:21,444][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:22,350][__main__][INFO] - Iteration 516 took 23s (40.23% Gen, 55.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 4s. Estimated total time: 19h 25m 37s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 51s, 500 more iterations: 3h 14m 16s. [2025-11-13 11:25:22,352][__main__][INFO] - Starting iteration 516. [2025-11-13 11:25:22,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:22,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:32,309][__main__][INFO] - Number of regex retries in iteration 516: 0 [2025-11-13 11:25:32,309][__main__][INFO] - agents played in iteration 516 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:25:32,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:32,857][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:32,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:33,574][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:33,870][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:34,194][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:34,522][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:34,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:35,815][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:36,141][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:25:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:25:36,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:25:37,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:25:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:25:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:25:38,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:25:38,421][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:25:38,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:25:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:25:39,394][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:25:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:25:40,044][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:25:40,368][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:25:40,693][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:25:41,017][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:25:41,343][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:25:41,667][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:25:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:25:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:25:42,646][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:25:42,970][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:25:43,292][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:25:43,617][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:25:43,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:25:44,644][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:25:45,361][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:25:45,362][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:25:45,364][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:25:46,219][__main__][INFO] - Iteration 517 took 23s (41.71% Gen, 54.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 34m 18s. Estimated total time: 19h 53m 14s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 46s, 500 more iterations: 3h 18m 52s. [2025-11-13 11:25:46,221][__main__][INFO] - Starting iteration 517. [2025-11-13 11:25:46,229][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:25:46,230][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:25:56,029][__main__][INFO] - Number of regex retries in iteration 517: 0 [2025-11-13 11:25:56,030][__main__][INFO] - agents played in iteration 517 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:25:56,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:56,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:56,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:56,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:25:56,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:25:56,585][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:25:57,311][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:25:57,604][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:25:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:25:58,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:25:58,577][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:25:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:25:59,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:25:59,554][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:25:59,880][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:00,208][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:00,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:01,189][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:01,514][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:01,837][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:02,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:02,806][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:03,455][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:03,779][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:04,102][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:04,425][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:04,748][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:05,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:05,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:06,373][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:06,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:07,020][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:07,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:08,358][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:09,069][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:09,071][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:09,072][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:09,943][__main__][INFO] - Iteration 518 took 23s (41.32% Gen, 54.99% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 26m 30s. Estimated total time: 19h 45m 50s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 38s. [2025-11-13 11:26:09,945][__main__][INFO] - Starting iteration 518. [2025-11-13 11:26:09,947][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:09,948][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:18,682][__main__][INFO] - Number of regex retries in iteration 518: 0 [2025-11-13 11:26:18,683][__main__][INFO] - agents played in iteration 518 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:26:19,115][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:19,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:19,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:19,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:19,218][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:19,219][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:19,945][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:20,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:21,214][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:21,867][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:22,194][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:22,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:22,847][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:23,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:23,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:23,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:24,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:24,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:25,111][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:26,408][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:26,733][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:27,057][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:27,702][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:28,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:28,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:28,675][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:29,000][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:29,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:29,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:30,296][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:30,989][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:31,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:31,716][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:31,718][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:32,592][__main__][INFO] - Iteration 519 took 22s (38.57% Gen, 57.56% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 32m 34s. Estimated total time: 18h 52m 16s. Time estimates for 10 more iterations: 3m 46s, 100 more iterations: 37m 44s, 500 more iterations: 3h 8m 42s. [2025-11-13 11:26:32,595][__main__][INFO] - Starting iteration 519. [2025-11-13 11:26:32,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:32,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:26:41,931][__main__][INFO] - Number of regex retries in iteration 519: 0 [2025-11-13 11:26:41,932][__main__][INFO] - agents played in iteration 519 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:26:42,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:42,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:42,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:42,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:26:42,487][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:26:42,488][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:26:43,222][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:26:43,530][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:26:43,858][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:26:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:26:44,504][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:26:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:26:45,158][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:26:45,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:26:45,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:26:46,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:26:46,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:26:46,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:26:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:26:47,441][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:26:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:26:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:26:48,414][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:26:48,741][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:26:49,066][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:26:49,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:26:49,714][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:26:50,039][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:26:50,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:26:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:26:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:26:51,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:26:51,660][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:26:51,984][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:26:52,306][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:26:52,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:26:52,955][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:26:53,281][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:26:53,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:26:54,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:26:54,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:26:54,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:26:54,986][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:26:55,882][__main__][INFO] - Iteration 520 took 23s (40.09% Gen, 56.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 4m 7s. Estimated total time: 19h 24m 13s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 2s. [2025-11-13 11:26:55,883][__main__][INFO] - Starting iteration 520. [2025-11-13 11:26:55,887][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 11:26:55,887][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:05,227][__main__][INFO] - Number of regex retries in iteration 520: 0 [2025-11-13 11:27:05,227][__main__][INFO] - agents played in iteration 520 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:27:05,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:05,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:05,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:06,525][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:07,146][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:08,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:09,428][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:09,754][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:10,077][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:10,402][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:10,725][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:11,050][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:11,698][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:12,025][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:12,348][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:13,319][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:13,645][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:13,969][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:14,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:14,618][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:15,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:15,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:16,248][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:16,571][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:16,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:17,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:18,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:18,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:18,352][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:20,057][__main__][INFO] - Iteration 521 took 24s (38.64% Gen, 54.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 48m 3s. Estimated total time: 20h 8m 33s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 17s, 500 more iterations: 3h 21m 25s. [2025-11-13 11:27:20,059][__main__][INFO] - Starting iteration 521. [2025-11-13 11:27:20,062][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:20,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:29,498][__main__][INFO] - Number of regex retries in iteration 521: 0 [2025-11-13 11:27:29,498][__main__][INFO] - agents played in iteration 521 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:27:29,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:29,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:29,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:30,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:30,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:30,033][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:30,752][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:31,374][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:31,699][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:32,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:32,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:33,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:33,988][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:34,966][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:35,943][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:36,266][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:27:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:27:36,920][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:27:37,247][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:27:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:27:37,895][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:27:38,218][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:27:38,543][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:27:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:27:39,191][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:27:39,515][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:27:39,839][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:27:40,163][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:27:40,490][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:27:40,814][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:27:41,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:27:41,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:27:42,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:27:42,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:27:42,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:27:43,439][__main__][INFO] - Iteration 522 took 23s (40.36% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 8m 2s. Estimated total time: 19h 28m 56s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 49s. [2025-11-13 11:27:43,441][__main__][INFO] - Starting iteration 522. [2025-11-13 11:27:43,444][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:27:43,445][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:27:53,033][__main__][INFO] - Number of regex retries in iteration 522: 0 [2025-11-13 11:27:53,033][__main__][INFO] - agents played in iteration 522 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:27:53,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:53,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:53,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:53,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:27:53,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:27:53,575][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:27:54,286][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:27:54,581][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:27:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:27:55,228][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:27:55,554][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:27:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:27:56,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:27:56,523][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:27:56,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:27:57,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:27:57,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:27:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:27:58,150][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:27:58,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:27:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:27:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:27:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:27:59,774][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:00,099][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:00,751][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:01,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:01,402][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:01,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:02,053][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:03,026][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:03,357][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:03,684][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:04,338][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:04,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:05,370][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:06,085][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:06,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:06,088][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:06,889][__main__][INFO] - Iteration 523 took 23s (40.89% Gen, 55.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 11m 0s. Estimated total time: 19h 32m 17s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 4s, 500 more iterations: 3h 15m 22s. [2025-11-13 11:28:06,891][__main__][INFO] - Starting iteration 523. [2025-11-13 11:28:06,893][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:06,894][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:16,578][__main__][INFO] - Number of regex retries in iteration 523: 0 [2025-11-13 11:28:16,578][__main__][INFO] - agents played in iteration 523 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:28:17,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:17,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:17,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:17,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:17,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:17,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:17,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:18,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:18,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:18,791][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:19,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:20,427][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:21,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:22,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:22,710][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:23,047][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:23,372][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:24,019][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:24,353][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:24,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:25,000][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:25,323][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:25,654][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:25,976][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:26,302][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:26,626][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:26,956][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:27,280][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:27,605][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:28,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:28,962][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:29,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:29,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:29,681][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:30,512][__main__][INFO] - Iteration 524 took 23s (41.00% Gen, 55.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 19m 17s. Estimated total time: 19h 40m 58s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 11:28:30,514][__main__][INFO] - Starting iteration 524. [2025-11-13 11:28:30,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:30,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:28:38,824][__main__][INFO] - Number of regex retries in iteration 524: 0 [2025-11-13 11:28:38,825][__main__][INFO] - agents played in iteration 524 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:28:39,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:39,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:39,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:39,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:28:39,370][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:28:39,370][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:28:40,085][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:28:40,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:28:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:28:41,034][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:28:41,365][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:28:41,691][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:28:42,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:28:42,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:28:42,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:28:42,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:28:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:28:43,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:28:43,965][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:28:44,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:28:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:28:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:28:45,266][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:28:45,592][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:28:45,916][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:28:46,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:28:46,567][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:28:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:28:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:28:47,543][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:28:47,867][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:28:48,191][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:28:48,514][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:28:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:28:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:28:49,487][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:28:49,817][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:28:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:28:50,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:28:51,160][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:28:51,874][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:28:51,876][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:28:51,878][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:28:52,892][__main__][INFO] - Iteration 525 took 22s (37.12% Gen, 58.34% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 16m 42s. Estimated total time: 18h 38m 46s. Time estimates for 10 more iterations: 3m 43s, 100 more iterations: 37m 17s, 500 more iterations: 3h 6m 27s. [2025-11-13 11:28:52,894][__main__][INFO] - Starting iteration 525. [2025-11-13 11:28:52,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:28:52,898][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:02,653][__main__][INFO] - Number of regex retries in iteration 525: 0 [2025-11-13 11:29:02,653][__main__][INFO] - agents played in iteration 525 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:29:03,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:03,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:03,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:03,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:03,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:03,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:03,910][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:04,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:04,529][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:04,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:05,178][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:05,502][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:05,827][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:06,151][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:06,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:07,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:07,773][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:08,097][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:08,421][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:09,067][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:09,390][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:09,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:10,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:11,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:11,979][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:12,628][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:12,965][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:13,290][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:13,936][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:14,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:14,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:15,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:15,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:15,663][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:16,605][__main__][INFO] - Iteration 526 took 23s (41.14% Gen, 54.87% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 22m 57s. Estimated total time: 19h 45m 24s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 34s. [2025-11-13 11:29:16,607][__main__][INFO] - Starting iteration 526. [2025-11-13 11:29:16,609][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:16,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:26,471][__main__][INFO] - Number of regex retries in iteration 526: 0 [2025-11-13 11:29:26,471][__main__][INFO] - agents played in iteration 526 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:29:26,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:26,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:26,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:27,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:27,011][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:27,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:27,720][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:28,014][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:28,340][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:28,668][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:28,992][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:29,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:29,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:30,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:31,270][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:31,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:32,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:32,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:32,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:33,221][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:33,546][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:33,871][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:34,194][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:34,518][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:35,167][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:35,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:35,821][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:29:36,147][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:29:36,478][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:29:36,795][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:29:37,122][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:29:37,448][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:29:37,777][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:29:38,104][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:29:38,799][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:29:39,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:29:39,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:29:39,504][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:29:40,304][__main__][INFO] - Iteration 527 took 23s (41.62% Gen, 55.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 21m 57s. Estimated total time: 19h 44m 47s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 27s. [2025-11-13 11:29:40,306][__main__][INFO] - Starting iteration 527. [2025-11-13 11:29:40,309][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:29:40,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:29:50,473][__main__][INFO] - Number of regex retries in iteration 527: 0 [2025-11-13 11:29:50,474][__main__][INFO] - agents played in iteration 527 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:29:50,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:50,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:50,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:51,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:29:51,024][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:29:51,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:29:51,760][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:29:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:29:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:29:52,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:29:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:29:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:29:53,679][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:29:54,001][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:29:54,328][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:29:54,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:29:54,975][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:29:55,298][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:29:55,624][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:29:55,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:29:56,272][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:29:56,595][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:29:56,918][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:29:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:29:57,574][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:29:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:29:58,229][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:29:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:29:58,884][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:29:59,209][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:29:59,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:29:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:00,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:00,507][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:00,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:01,161][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:01,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:01,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:02,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:02,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:03,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:03,520][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:03,522][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:04,322][__main__][INFO] - Iteration 528 took 24s (42.32% Gen, 54.33% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 37m 26s. Estimated total time: 20h 0m 40s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 6s. [2025-11-13 11:30:04,324][__main__][INFO] - Starting iteration 528. [2025-11-13 11:30:04,326][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:04,327][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:14,040][__main__][INFO] - Number of regex retries in iteration 528: 0 [2025-11-13 11:30:14,041][__main__][INFO] - agents played in iteration 528 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:30:14,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:14,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:14,583][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:15,595][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:15,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:16,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:16,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:16,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:17,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:18,186][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:18,512][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:18,834][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:19,158][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:19,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:19,805][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:20,135][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:20,458][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:21,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:21,438][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:21,762][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:22,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:22,733][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:23,059][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:23,711][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:24,037][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:24,368][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:25,023][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:25,348][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:25,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:26,509][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:27,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:27,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:27,227][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:28,057][__main__][INFO] - Iteration 529 took 23s (40.93% Gen, 55.56% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 22m 56s. Estimated total time: 19h 46m 34s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 45s. [2025-11-13 11:30:28,059][__main__][INFO] - Starting iteration 529. [2025-11-13 11:30:28,062][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:28,063][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:37,783][__main__][INFO] - Number of regex retries in iteration 529: 0 [2025-11-13 11:30:37,783][__main__][INFO] - agents played in iteration 529 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:30:38,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:30:38,351][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:30:38,351][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:30:39,074][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:30:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:30:39,699][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:30:40,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:30:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:30:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:30:40,994][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:30:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:30:41,635][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:30:41,958][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:30:42,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:30:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:30:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:30:43,254][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:30:43,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:30:43,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:30:44,228][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:30:44,551][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:30:44,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:30:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:30:45,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:30:45,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:30:46,189][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:30:46,509][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:30:46,832][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:30:47,156][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:30:47,480][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:30:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:30:48,130][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:30:48,453][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:30:48,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:30:49,103][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:30:49,427][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:30:50,147][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:30:50,834][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:30:50,835][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:30:50,837][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:30:51,669][__main__][INFO] - Iteration 530 took 23s (41.18% Gen, 55.29% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 16m 20s. Estimated total time: 19h 40m 22s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 20s, 500 more iterations: 3h 16m 43s. [2025-11-13 11:30:51,671][__main__][INFO] - Starting iteration 530. [2025-11-13 11:30:51,673][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 11:30:51,674][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:30:57,036][mllm.models.large_language_model_local][WARNING] - Response A did not match regex: (|), retry 1/1 [2025-11-13 11:31:00,700][__main__][INFO] - Number of regex retries in iteration 530: 1 [2025-11-13 11:31:00,701][__main__][INFO] - agents played in iteration 530 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:31:01,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:01,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:01,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:01,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:02,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:02,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:03,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:03,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:04,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:04,883][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:05,209][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:05,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:06,204][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:07,518][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:08,172][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:08,497][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:08,821][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:09,471][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:09,794][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:10,118][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:10,445][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:11,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:11,418][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:12,392][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:13,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:13,806][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:13,809][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:13,810][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:15,480][__main__][INFO] - Iteration 531 took 23s (37.92% Gen, 55.06% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 25m 58s. Estimated total time: 19h 50m 23s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 23s. [2025-11-13 11:31:15,483][__main__][INFO] - Starting iteration 531. [2025-11-13 11:31:15,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:15,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:25,150][__main__][INFO] - Number of regex retries in iteration 531: 0 [2025-11-13 11:31:25,150][__main__][INFO] - agents played in iteration 531 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:31:25,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:25,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:25,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:25,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:25,716][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:25,716][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:26,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:26,734][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:27,054][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:28,034][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:28,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:29,006][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:29,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:29,657][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:29,983][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:30,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:30,637][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:30,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:31,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:31,621][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:32,280][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:32,606][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:32,932][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:33,256][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:33,585][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:34,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:34,883][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:35,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:35,531][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:36,183][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:31:36,506][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:31:36,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:31:37,554][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:31:38,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:31:38,258][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:31:38,260][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:31:39,167][__main__][INFO] - Iteration 532 took 23s (40.81% Gen, 55.36% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 19m 16s. Estimated total time: 19h 44m 5s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 28s, 500 more iterations: 3h 17m 20s. [2025-11-13 11:31:39,169][__main__][INFO] - Starting iteration 532. [2025-11-13 11:31:39,172][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:31:39,173][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:31:48,813][__main__][INFO] - Number of regex retries in iteration 532: 0 [2025-11-13 11:31:48,813][__main__][INFO] - agents played in iteration 532 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:31:49,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:31:49,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:31:49,361][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:31:50,068][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:31:50,363][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:31:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:31:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:31:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:31:51,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:31:51,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:31:52,319][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:31:52,644][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:31:52,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:31:53,295][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:31:53,619][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:31:53,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:31:54,271][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:31:54,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:31:54,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:31:55,241][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:31:55,566][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:31:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:31:56,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:31:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:31:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:31:57,190][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:31:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:31:57,842][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:31:58,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:31:58,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:31:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:31:59,139][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:31:59,463][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:31:59,787][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:00,111][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:00,435][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:01,138][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:01,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:01,866][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:01,868][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:02,874][__main__][INFO] - Iteration 533 took 23s (40.67% Gen, 55.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 19m 55s. Estimated total time: 19h 45m 8s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 31s. [2025-11-13 11:32:02,876][__main__][INFO] - Starting iteration 533. [2025-11-13 11:32:02,880][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:02,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:12,962][__main__][INFO] - Number of regex retries in iteration 533: 0 [2025-11-13 11:32:12,963][__main__][INFO] - agents played in iteration 533 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:32:13,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:13,513][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:13,513][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:14,213][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:14,830][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:15,154][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:15,480][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:16,137][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:16,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:16,786][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:17,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:17,759][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:18,083][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:19,057][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:19,381][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:19,712][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:20,032][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:20,680][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:21,011][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:21,328][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:22,299][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:22,624][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:22,949][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:23,271][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:23,603][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:24,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:25,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:25,985][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:25,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:25,988][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:26,907][__main__][INFO] - Iteration 534 took 24s (41.96% Gen, 54.21% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 35m 48s. Estimated total time: 20h 1m 25s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 2s, 500 more iterations: 3h 20m 14s. [2025-11-13 11:32:26,910][__main__][INFO] - Starting iteration 534. [2025-11-13 11:32:26,913][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:26,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:32:37,087][__main__][INFO] - Number of regex retries in iteration 534: 0 [2025-11-13 11:32:37,088][__main__][INFO] - agents played in iteration 534 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:32:37,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:32:37,631][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:32:37,631][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:32:38,335][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:32:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:32:38,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:32:39,275][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:32:39,599][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:32:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:32:40,256][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:32:40,580][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:32:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:32:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:32:41,559][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:32:41,883][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:32:42,214][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:32:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:32:42,857][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:32:43,181][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:32:43,506][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:32:43,830][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:32:44,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:32:44,479][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:32:44,804][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:32:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:32:45,452][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:32:45,781][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:32:46,107][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:32:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:32:46,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:32:47,081][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:32:47,404][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:32:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:32:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:32:48,374][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:32:48,702][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:32:49,399][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:32:50,139][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:32:50,141][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:32:50,142][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:32:51,030][__main__][INFO] - Iteration 535 took 24s (42.19% Gen, 54.13% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 39m 53s. Estimated total time: 20h 5m 55s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 11s, 500 more iterations: 3h 20m 59s. [2025-11-13 11:32:51,032][__main__][INFO] - Starting iteration 535. [2025-11-13 11:32:51,035][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:32:51,036][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:01,274][__main__][INFO] - Number of regex retries in iteration 535: 0 [2025-11-13 11:33:01,275][__main__][INFO] - agents played in iteration 535 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:33:01,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:01,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:01,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:01,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:01,801][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:01,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:02,507][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:02,802][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:03,127][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:03,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:04,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:05,081][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:05,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:05,730][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:06,054][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:06,378][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:07,025][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:07,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:07,999][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:08,331][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:09,298][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:09,950][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:10,606][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:10,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:11,253][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:11,583][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:12,557][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:12,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:13,616][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:14,340][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:14,341][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:14,343][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:15,168][__main__][INFO] - Iteration 536 took 24s (42.43% Gen, 54.15% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 40m 17s. Estimated total time: 20h 6m 42s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 13s, 500 more iterations: 3h 21m 7s. [2025-11-13 11:33:15,171][__main__][INFO] - Starting iteration 536. [2025-11-13 11:33:15,173][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:15,174][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:24,793][__main__][INFO] - Number of regex retries in iteration 536: 0 [2025-11-13 11:33:24,794][__main__][INFO] - agents played in iteration 536 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:33:25,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,347][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:25,348][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:25,348][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:26,086][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:26,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:27,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:27,365][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:27,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:28,019][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:28,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:28,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:28,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:29,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:29,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:29,961][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:30,285][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:30,610][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:30,934][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:31,907][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:32,232][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:32,560][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:32,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:33,531][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:33,858][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:34,181][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:34,834][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:33:36,136][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:33:36,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:33:37,164][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:33:37,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:33:37,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:33:37,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:33:38,789][__main__][INFO] - Iteration 537 took 23s (40.73% Gen, 55.47% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 13m 59s. Estimated total time: 19h 40m 48s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 48s. [2025-11-13 11:33:38,791][__main__][INFO] - Starting iteration 537. [2025-11-13 11:33:38,794][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:33:38,794][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:33:48,708][__main__][INFO] - Number of regex retries in iteration 537: 0 [2025-11-13 11:33:48,708][__main__][INFO] - agents played in iteration 537 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:33:49,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:33:49,262][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:33:49,263][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:33:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:33:50,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:33:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:33:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:33:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:33:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:33:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:33:52,242][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:33:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:33:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:33:53,215][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:33:53,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:33:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:33:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:33:54,512][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:33:54,836][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:33:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:33:55,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:33:55,813][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:33:56,137][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:33:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:33:56,786][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:33:57,111][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:33:57,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:33:57,768][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:33:58,091][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:33:58,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:33:58,752][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:33:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:33:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:33:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:00,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:00,385][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:01,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:01,833][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:01,835][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:01,837][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:02,757][__main__][INFO] - Iteration 538 took 23s (41.37% Gen, 54.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 30m 58s. Estimated total time: 19h 58m 11s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 41s. [2025-11-13 11:34:02,759][__main__][INFO] - Starting iteration 538. [2025-11-13 11:34:02,763][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:02,763][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:13,361][__main__][INFO] - Number of regex retries in iteration 538: 0 [2025-11-13 11:34:13,361][__main__][INFO] - agents played in iteration 538 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:34:13,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:13,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:13,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:13,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:13,933][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:13,933][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:14,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:14,961][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:15,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:15,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:15,934][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:16,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:16,582][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:16,906][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:17,230][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:17,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:17,880][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:18,206][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:18,530][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:18,854][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:19,178][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:19,502][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:19,825][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:20,796][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:21,775][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:22,434][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:22,759][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:23,082][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:23,407][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:23,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:24,053][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:24,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:24,703][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:25,028][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:25,746][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:26,507][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:26,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:26,511][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:27,424][__main__][INFO] - Iteration 539 took 24s (42.97% Gen, 53.32% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 5m 28s. Estimated total time: 20h 33m 6s. Time estimates for 10 more iterations: 4m 6s, 100 more iterations: 41m 6s, 500 more iterations: 3h 25m 31s. [2025-11-13 11:34:27,426][__main__][INFO] - Starting iteration 539. [2025-11-13 11:34:27,429][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:27,429][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:34:37,871][__main__][INFO] - Number of regex retries in iteration 539: 0 [2025-11-13 11:34:37,871][__main__][INFO] - agents played in iteration 539 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:34:38,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,355][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:34:38,425][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:34:38,426][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:34:39,157][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:34:39,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:34:39,779][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:34:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:34:40,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:34:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:34:41,084][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:34:41,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:34:41,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:34:42,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:34:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:34:42,709][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:34:43,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:34:43,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:34:43,688][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:34:44,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:34:44,341][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:34:44,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:34:44,993][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:34:45,316][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:34:45,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:34:45,964][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:34:46,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:34:46,613][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:34:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:34:47,262][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:34:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:34:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:34:48,236][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:34:48,562][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:34:48,888][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:34:49,212][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:34:49,537][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:34:50,247][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:34:50,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:34:50,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:34:50,984][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:34:51,785][__main__][INFO] - Iteration 540 took 24s (42.87% Gen, 53.84% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 49m 49s. Estimated total time: 20h 17m 51s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 35s, 500 more iterations: 3h 22m 58s. [2025-11-13 11:34:51,787][__main__][INFO] - Starting iteration 540. [2025-11-13 11:34:51,790][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 11:34:51,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:01,427][__main__][INFO] - Number of regex retries in iteration 540: 0 [2025-11-13 11:35:01,427][__main__][INFO] - agents played in iteration 540 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:35:01,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:01,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:01,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:01,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:01,979][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:01,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:02,706][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:03,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:03,658][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:03,981][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:04,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:04,628][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:04,951][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:05,276][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:05,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:06,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:06,570][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:06,894][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:07,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:07,540][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:07,867][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:08,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:08,843][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:09,167][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:09,499][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:09,826][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:10,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:10,472][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:10,798][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:11,121][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:11,771][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:12,095][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:12,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:12,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:13,067][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:13,777][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:14,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:14,497][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:14,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:16,185][__main__][INFO] - Iteration 541 took 24s (39.50% Gen, 53.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 51m 21s. Estimated total time: 20h 19m 47s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 39s, 500 more iterations: 3h 23m 17s. [2025-11-13 11:35:16,187][__main__][INFO] - Starting iteration 541. [2025-11-13 11:35:16,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:16,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:26,668][__main__][INFO] - Number of regex retries in iteration 541: 0 [2025-11-13 11:35:26,669][__main__][INFO] - agents played in iteration 541 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:35:27,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:27,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:27,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:27,222][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:27,223][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:27,223][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:27,922][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:28,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:28,540][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:28,863][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:29,188][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:29,834][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:30,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:30,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:30,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:31,135][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:31,783][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:32,432][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:32,756][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:33,079][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:33,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:33,731][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:34,054][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:34,380][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:34,706][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:35,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:35,361][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:35,689][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:35:36,016][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:35:36,343][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:35:36,671][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:35:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:35:37,325][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:35:37,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:35:37,978][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:35:38,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:35:39,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:35:39,734][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:35:39,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:35:39,737][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:35:40,543][__main__][INFO] - Iteration 542 took 24s (43.02% Gen, 53.66% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 48m 52s. Estimated total time: 20h 17m 43s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 35s, 500 more iterations: 3h 22m 57s. [2025-11-13 11:35:40,545][__main__][INFO] - Starting iteration 542. [2025-11-13 11:35:40,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:35:40,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:35:50,911][__main__][INFO] - Number of regex retries in iteration 542: 0 [2025-11-13 11:35:50,912][__main__][INFO] - agents played in iteration 542 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:35:51,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:35:51,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:35:51,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:35:52,194][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:35:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:35:52,815][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:35:53,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:35:53,462][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:35:53,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:35:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:35:54,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:35:54,756][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:35:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:35:55,407][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:35:55,734][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:35:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:35:56,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:35:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:35:57,036][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:35:57,366][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:35:57,693][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:35:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:35:58,347][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:35:58,671][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:35:58,994][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:35:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:35:59,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:35:59,966][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:00,614][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:00,942][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:01,271][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:01,599][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:01,924][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:02,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:03,318][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:04,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:04,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:04,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:04,913][__main__][INFO] - Iteration 543 took 24s (42.53% Gen, 53.94% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 49m 3s. Estimated total time: 20h 18m 18s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 36s, 500 more iterations: 3h 23m 3s. [2025-11-13 11:36:04,916][__main__][INFO] - Starting iteration 543. [2025-11-13 11:36:04,919][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:04,920][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:16,112][__main__][INFO] - Number of regex retries in iteration 543: 0 [2025-11-13 11:36:16,112][__main__][INFO] - agents played in iteration 543 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:36:16,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:16,647][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:16,647][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:17,373][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:17,676][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:17,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:18,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:18,652][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:19,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:19,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:19,959][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:20,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:20,608][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:20,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:21,260][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:21,911][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:22,571][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:22,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:23,227][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:23,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:24,206][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:24,537][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:24,861][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:25,843][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:26,169][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:26,492][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:26,816][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:27,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:27,472][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:27,799][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:28,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:29,244][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:29,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:29,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:30,073][__main__][INFO] - Iteration 544 took 25s (44.49% Gen, 52.22% Train). Generation: 11s, Training: 13s. Estimated remaining time: 20h 28m 3s. Estimated total time: 20h 57m 43s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 55s, 500 more iterations: 3h 29m 37s. [2025-11-13 11:36:30,075][__main__][INFO] - Starting iteration 544. [2025-11-13 11:36:30,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:30,079][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:36:41,485][__main__][INFO] - Number of regex retries in iteration 544: 0 [2025-11-13 11:36:41,486][__main__][INFO] - agents played in iteration 544 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:36:41,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:41,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:42,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:36:42,032][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:36:42,032][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:36:42,708][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:36:43,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:36:43,327][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:36:43,649][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:36:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:36:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:36:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:36:44,945][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:36:45,275][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:36:45,593][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:36:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:36:46,240][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:36:46,572][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:36:46,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:36:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:36:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:36:47,869][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:36:48,186][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:36:48,509][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:36:48,832][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:36:49,156][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:36:49,480][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:36:49,809][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:36:50,134][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:36:50,464][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:36:50,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:36:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:36:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:36:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:36:52,079][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:36:52,404][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:36:52,728][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:36:53,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:36:53,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:36:54,469][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:36:54,470][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:36:54,472][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:36:55,274][__main__][INFO] - Iteration 545 took 25s (45.27% Gen, 51.54% Train). Generation: 11s, Training: 12s. Estimated remaining time: 20h 29m 44s. Estimated total time: 20h 59m 50s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 59s, 500 more iterations: 3h 29m 58s. [2025-11-13 11:36:55,276][__main__][INFO] - Starting iteration 545. [2025-11-13 11:36:55,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:36:55,279][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:05,428][__main__][INFO] - Number of regex retries in iteration 545: 0 [2025-11-13 11:37:05,428][__main__][INFO] - agents played in iteration 545 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:37:05,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:05,969][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:05,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:06,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:06,960][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:07,606][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:07,930][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:08,254][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:08,577][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:08,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:09,227][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:09,549][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:09,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:10,196][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:10,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:11,809][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:12,780][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:13,104][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:13,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:14,074][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:14,398][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:15,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:15,369][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:15,693][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:16,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:16,666][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:16,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:17,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:18,436][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:18,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:18,439][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:19,246][__main__][INFO] - Iteration 546 took 23s (42.34% Gen, 54.29% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 27m 54s. Estimated total time: 19h 58m 23s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 56s, 500 more iterations: 3h 19m 43s. [2025-11-13 11:37:19,248][__main__][INFO] - Starting iteration 546. [2025-11-13 11:37:19,251][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:19,251][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:30,633][__main__][INFO] - Number of regex retries in iteration 546: 0 [2025-11-13 11:37:30,634][__main__][INFO] - agents played in iteration 546 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:37:31,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:31,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:31,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:31,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:31,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:31,170][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:31,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:32,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:32,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:32,798][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:33,452][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:33,780][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:34,429][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:34,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:35,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:37:35,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:37:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:37:36,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:37:36,378][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:37:36,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:37:37,019][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:37:37,343][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:37:37,679][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:37:38,002][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:37:38,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:37:38,649][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:37:38,979][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:37:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:37:39,620][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:37:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:37:40,277][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:37:40,596][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:37:40,923][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:37:41,248][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:37:41,579][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:37:41,898][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:37:42,229][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:37:42,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:37:43,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:37:43,649][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:37:43,651][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:37:44,447][__main__][INFO] - Iteration 547 took 25s (45.17% Gen, 51.66% Train). Generation: 11s, Training: 13s. Estimated remaining time: 20h 28m 58s. Estimated total time: 20h 59m 53s. Time estimates for 10 more iterations: 4m 11s, 100 more iterations: 41m 59s, 500 more iterations: 3h 29m 58s. [2025-11-13 11:37:44,449][__main__][INFO] - Starting iteration 547. [2025-11-13 11:37:44,452][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:37:44,453][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:37:55,531][__main__][INFO] - Number of regex retries in iteration 547: 0 [2025-11-13 11:37:55,531][__main__][INFO] - agents played in iteration 547 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:37:55,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:56,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:56,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:56,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:37:56,067][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:37:56,068][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:37:56,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:37:57,031][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:37:57,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:37:57,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:37:58,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:37:58,327][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:37:58,650][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:37:58,974][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:37:59,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:37:59,620][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:37:59,943][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:00,271][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:00,594][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:00,916][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:01,239][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:01,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:01,892][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:02,215][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:02,538][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:03,187][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:03,510][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:03,834][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:04,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:04,481][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:04,808][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:05,131][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:05,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:06,114][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:06,439][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:07,091][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:07,830][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:08,511][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:08,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:08,514][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:09,364][__main__][INFO] - Iteration 548 took 24s (44.47% Gen, 52.11% Train). Generation: 11s, Training: 12s. Estimated remaining time: 20h 14m 18s. Estimated total time: 20h 45m 37s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 31s, 500 more iterations: 3h 27m 36s. [2025-11-13 11:38:09,366][__main__][INFO] - Starting iteration 548. [2025-11-13 11:38:09,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:09,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:19,645][__main__][INFO] - Number of regex retries in iteration 548: 0 [2025-11-13 11:38:19,646][__main__][INFO] - agents played in iteration 548 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:38:20,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:20,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:20,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:20,177][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:20,178][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:20,178][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:20,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:21,797][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:22,122][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:22,445][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:22,770][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:23,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:23,424][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:23,746][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:24,071][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:24,394][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:24,717][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:25,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:25,363][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:25,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:26,009][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:26,337][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:26,986][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:27,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:27,957][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:28,281][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:28,605][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:28,931][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:29,580][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:29,904][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:30,554][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:30,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:31,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:31,946][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:32,623][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:32,624][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:32,625][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:33,420][__main__][INFO] - Iteration 549 took 24s (42.73% Gen, 53.96% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 30m 54s. Estimated total time: 20h 2m 37s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 5s, 500 more iterations: 3h 20m 26s. [2025-11-13 11:38:33,422][__main__][INFO] - Starting iteration 549. [2025-11-13 11:38:33,425][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:33,425][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:38:44,169][__main__][INFO] - Number of regex retries in iteration 549: 0 [2025-11-13 11:38:44,170][__main__][INFO] - agents played in iteration 549 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:38:44,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:44,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:44,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:44,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:38:44,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:38:44,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:38:45,381][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:38:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:38:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:38:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:38:46,648][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:38:46,978][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:38:47,296][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:38:47,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:38:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:38:48,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:38:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:38:48,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:38:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:38:49,569][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:38:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:38:50,218][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:38:50,542][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:38:50,871][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:38:51,196][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:38:51,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:38:51,846][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:38:52,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:38:52,494][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:38:52,817][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:38:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:38:53,466][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:38:53,793][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:38:54,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:38:54,443][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:38:54,766][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:38:55,091][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:38:55,418][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:38:55,742][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:38:56,463][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:38:57,151][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:38:57,153][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:38:57,155][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:38:57,990][__main__][INFO] - Iteration 550 took 24s (43.73% Gen, 52.86% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 56m 11s. Estimated total time: 20h 28m 19s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 56s, 500 more iterations: 3h 24m 43s. [2025-11-13 11:38:57,992][__main__][INFO] - Starting iteration 550. [2025-11-13 11:38:57,995][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 11:38:57,995][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:08,154][__main__][INFO] - Number of regex retries in iteration 550: 0 [2025-11-13 11:39:08,155][__main__][INFO] - agents played in iteration 550 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:39:08,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:08,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:08,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:09,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:09,685][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:10,009][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:10,657][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:10,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:11,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:11,630][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:11,953][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:12,606][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:12,931][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:13,582][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:13,907][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:14,233][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:14,558][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:15,211][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:15,856][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:16,512][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:17,165][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:17,489][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:17,813][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:18,137][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:18,463][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:18,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:19,114][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:19,441][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:19,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:20,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:21,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:21,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:21,202][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:22,746][__main__][INFO] - Iteration 551 took 24s (41.04% Gen, 52.71% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 5m 4s. Estimated total time: 20h 37m 37s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 15s, 500 more iterations: 3h 26m 16s. [2025-11-13 11:39:22,748][__main__][INFO] - Starting iteration 551. [2025-11-13 11:39:22,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:22,751][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:33,680][__main__][INFO] - Number of regex retries in iteration 551: 0 [2025-11-13 11:39:33,680][__main__][INFO] - agents played in iteration 551 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:39:34,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:34,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:34,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:34,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:34,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:34,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:34,898][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:35,193][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:39:35,517][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:39:35,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:39:36,170][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:39:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:39:36,820][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:39:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:39:37,472][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:39:37,801][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:39:38,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:39:38,449][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:39:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:39:39,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:39:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:39:39,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:39:40,071][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:39:40,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:39:40,719][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:39:41,045][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:39:41,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:39:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:39:42,019][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:39:42,343][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:39:42,671][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:39:42,994][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:39:43,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:39:43,641][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:39:43,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:39:44,291][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:39:44,615][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:39:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:39:45,265][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:39:45,984][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:39:46,697][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:39:46,699][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:39:46,701][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:39:47,525][__main__][INFO] - Iteration 552 took 24s (44.11% Gen, 52.56% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 5m 45s. Estimated total time: 20h 38m 43s. Time estimates for 10 more iterations: 4m 7s, 100 more iterations: 41m 17s, 500 more iterations: 3h 26m 27s. [2025-11-13 11:39:47,526][__main__][INFO] - Starting iteration 552. [2025-11-13 11:39:47,530][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:39:47,530][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:39:58,318][__main__][INFO] - Number of regex retries in iteration 552: 0 [2025-11-13 11:39:58,318][__main__][INFO] - agents played in iteration 552 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:39:58,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:58,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:58,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:58,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:39:58,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:39:58,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:39:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:39:59,799][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:00,123][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:00,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:01,745][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:02,069][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:02,395][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:02,718][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:03,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:03,365][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:03,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:04,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:04,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:04,663][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:04,991][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:05,314][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:05,637][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:05,963][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:06,289][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:07,266][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:08,244][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:09,225][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:09,552][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:09,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:10,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:11,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:11,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:11,286][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:12,128][__main__][INFO] - Iteration 553 took 24s (43.85% Gen, 52.72% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 56m 34s. Estimated total time: 20h 29m 57s. Time estimates for 10 more iterations: 4m 5s, 100 more iterations: 40m 59s, 500 more iterations: 3h 24m 59s. [2025-11-13 11:40:12,130][__main__][INFO] - Starting iteration 553. [2025-11-13 11:40:12,133][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:12,133][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:22,486][__main__][INFO] - Number of regex retries in iteration 553: 0 [2025-11-13 11:40:22,486][__main__][INFO] - agents played in iteration 553 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:40:22,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:22,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:23,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:23,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:23,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:23,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:23,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:24,006][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:24,333][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:24,660][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:24,985][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:25,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:25,634][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:26,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:26,603][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:26,932][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:27,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:27,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:27,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:28,561][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:28,887][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:30,192][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:30,516][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:30,840][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:31,168][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:31,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:31,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:32,139][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:32,476][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:33,132][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:33,456][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:33,792][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:34,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:34,837][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:40:35,524][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:40:35,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:40:35,527][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:40:36,323][__main__][INFO] - Iteration 554 took 24s (42.79% Gen, 53.91% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 35m 47s. Estimated total time: 20h 9m 34s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 19s, 500 more iterations: 3h 21m 35s. [2025-11-13 11:40:36,325][__main__][INFO] - Starting iteration 554. [2025-11-13 11:40:36,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:40:36,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:40:47,357][__main__][INFO] - Number of regex retries in iteration 554: 0 [2025-11-13 11:40:47,358][__main__][INFO] - agents played in iteration 554 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:40:47,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:47,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:47,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:47,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:40:47,928][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:40:47,928][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:40:48,591][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:40:48,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:40:49,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:40:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:40:49,856][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:40:50,179][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:40:50,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:40:50,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:40:51,150][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:40:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:40:51,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:40:52,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:40:52,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:40:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:40:53,097][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:40:53,416][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:40:53,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:40:54,064][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:40:54,389][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:40:54,715][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:40:55,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:40:55,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:40:55,687][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:40:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:40:56,343][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:40:56,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:40:56,997][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:40:57,325][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:40:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:40:57,984][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:40:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:40:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:40:58,964][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:40:59,684][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:00,369][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:00,371][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:00,372][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:01,198][__main__][INFO] - Iteration 555 took 24s (44.34% Gen, 52.33% Train). Generation: 11s, Training: 13s. Estimated remaining time: 20h 9m 21s. Estimated total time: 20h 43m 32s. Time estimates for 10 more iterations: 4m 8s, 100 more iterations: 41m 27s, 500 more iterations: 3h 27m 15s. [2025-11-13 11:41:01,200][__main__][INFO] - Starting iteration 555. [2025-11-13 11:41:01,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:01,203][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:11,549][__main__][INFO] - Number of regex retries in iteration 555: 0 [2025-11-13 11:41:11,549][__main__][INFO] - agents played in iteration 555 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:41:11,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:12,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:12,064][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:12,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:12,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:12,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:13,081][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:13,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:13,729][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:14,053][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:14,376][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:14,700][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:15,023][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:15,347][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:15,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:15,994][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:16,317][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:16,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:18,596][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:19,243][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:19,566][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:19,890][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:20,212][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:20,535][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:21,508][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:21,833][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:22,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:22,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:23,137][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:23,859][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:24,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:24,535][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:24,537][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:25,383][__main__][INFO] - Iteration 556 took 24s (42.78% Gen, 53.71% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 34m 29s. Estimated total time: 20h 9m 4s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 18s, 500 more iterations: 3h 21m 30s. [2025-11-13 11:41:25,385][__main__][INFO] - Starting iteration 556. [2025-11-13 11:41:25,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:25,389][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:41:35,988][__main__][INFO] - Number of regex retries in iteration 556: 0 [2025-11-13 11:41:35,988][__main__][INFO] - agents played in iteration 556 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:41:36,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:41:36,534][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:41:36,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:41:37,227][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:41:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:41:37,845][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:41:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:41:38,503][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:41:38,828][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:41:39,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:41:39,478][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:41:39,802][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:41:40,130][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:41:40,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:41:40,777][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:41:41,106][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:41:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:41:41,760][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:41:42,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:41:42,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:41:42,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:41:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:41:43,382][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:41:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:41:44,030][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:41:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:41:44,680][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:41:45,005][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:41:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:41:45,657][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:41:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:41:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:41:46,629][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:41:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:41:47,286][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:41:47,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:41:48,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:41:49,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:41:49,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:41:49,057][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:41:49,837][__main__][INFO] - Iteration 557 took 24s (43.35% Gen, 53.45% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 47m 30s. Estimated total time: 20h 22m 30s. Time estimates for 10 more iterations: 4m 4s, 100 more iterations: 40m 45s, 500 more iterations: 3h 23m 45s. [2025-11-13 11:41:49,839][__main__][INFO] - Starting iteration 557. [2025-11-13 11:41:49,842][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:41:49,842][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:00,135][__main__][INFO] - Number of regex retries in iteration 557: 0 [2025-11-13 11:42:00,135][__main__][INFO] - agents played in iteration 557 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:42:00,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:00,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:00,690][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:01,417][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:02,038][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:02,692][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:03,016][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:03,340][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:03,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:03,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:04,634][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:04,957][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:05,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:05,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:05,930][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:06,255][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:06,903][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:07,229][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:07,551][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:08,198][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:08,528][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:08,847][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:09,170][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:09,493][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:09,817][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:10,143][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:10,467][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:11,125][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:11,442][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:11,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:12,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:13,201][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:13,202][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:13,204][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:14,076][__main__][INFO] - Iteration 558 took 24s (42.47% Gen, 53.93% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 36m 21s. Estimated total time: 20h 11m 45s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 23s, 500 more iterations: 3h 21m 57s. [2025-11-13 11:42:14,078][__main__][INFO] - Starting iteration 558. [2025-11-13 11:42:14,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:14,081][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:24,340][__main__][INFO] - Number of regex retries in iteration 558: 0 [2025-11-13 11:42:24,341][__main__][INFO] - agents played in iteration 558 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:42:24,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:24,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:24,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:25,564][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:25,858][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:26,182][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:26,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:26,828][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:27,151][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:27,477][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:27,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:28,125][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:28,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:28,774][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:29,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:29,422][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:30,069][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:30,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:30,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:31,043][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:31,368][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:31,692][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:32,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:32,342][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:32,992][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:33,320][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:33,644][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:33,974][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:34,299][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:34,949][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:35,599][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:42:35,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:42:36,670][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:42:37,350][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:42:37,351][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:42:37,352][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:42:38,147][__main__][INFO] - Iteration 559 took 24s (42.63% Gen, 54.06% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 27m 33s. Estimated total time: 20h 3m 22s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 6s, 500 more iterations: 3h 20m 33s. [2025-11-13 11:42:38,149][__main__][INFO] - Starting iteration 559. [2025-11-13 11:42:38,152][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:42:38,152][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:42:48,472][__main__][INFO] - Number of regex retries in iteration 559: 0 [2025-11-13 11:42:48,473][__main__][INFO] - agents played in iteration 559 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:42:48,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:48,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:48,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:49,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:42:49,008][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:42:49,008][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:42:49,685][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:42:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:42:50,308][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:42:50,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:42:50,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:42:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:42:51,605][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:42:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:42:52,256][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:42:52,579][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:42:52,904][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:42:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:42:53,551][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:42:53,875][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:42:54,198][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:42:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:42:54,849][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:42:55,172][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:42:55,497][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:42:55,829][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:42:56,146][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:42:56,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:42:56,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:42:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:42:57,446][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:42:57,770][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:42:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:42:58,418][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:42:58,741][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:42:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:42:59,390][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:42:59,716][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:00,040][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:00,758][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:01,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:01,483][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:01,484][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:02,269][__main__][INFO] - Iteration 560 took 24s (42.79% Gen, 53.95% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 29m 40s. Estimated total time: 20h 5m 52s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 11s, 500 more iterations: 3h 20m 58s. [2025-11-13 11:43:02,270][__main__][INFO] - Starting iteration 560. [2025-11-13 11:43:02,273][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 11:43:02,273][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:12,781][__main__][INFO] - Number of regex retries in iteration 560: 0 [2025-11-13 11:43:12,781][__main__][INFO] - agents played in iteration 560 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:43:13,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:13,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:13,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:13,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:13,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:13,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:13,992][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:14,287][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:14,613][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:14,936][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:15,259][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:15,581][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:15,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:16,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:16,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:17,204][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:17,850][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:18,505][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:18,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:19,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:19,478][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:20,135][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:20,459][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:20,784][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:21,111][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:21,437][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:22,417][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:22,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:23,069][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:23,394][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:23,727][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:24,052][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:24,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:25,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:25,771][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:25,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:25,774][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:27,280][__main__][INFO] - Iteration 561 took 25s (42.02% Gen, 51.95% Train). Generation: 10s, Training: 12s. Estimated remaining time: 20h 13m 46s. Estimated total time: 20h 50m 24s. Time estimates for 10 more iterations: 4m 10s, 100 more iterations: 41m 40s, 500 more iterations: 3h 28m 24s. [2025-11-13 11:43:27,282][__main__][INFO] - Starting iteration 561. [2025-11-13 11:43:27,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:27,286][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:43:37,434][__main__][INFO] - Number of regex retries in iteration 561: 0 [2025-11-13 11:43:37,435][__main__][INFO] - agents played in iteration 561 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:43:37,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:37,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:37,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:37,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:43:37,978][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:43:37,979][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:43:38,680][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:43:38,975][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:43:39,300][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:43:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:43:39,958][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:43:40,283][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:43:40,613][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:43:40,940][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:43:41,270][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:43:41,595][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:43:41,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:43:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:43:42,569][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:43:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:43:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:43:43,540][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:43:43,865][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:43:44,189][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:43:44,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:43:44,839][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:43:45,163][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:43:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:43:45,813][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:43:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:43:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:43:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:43:47,117][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:43:47,444][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:43:47,769][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:43:48,094][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:43:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:43:48,747][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:43:49,072][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:43:49,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:43:50,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:43:50,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:43:50,487][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:43:51,300][__main__][INFO] - Iteration 562 took 24s (42.26% Gen, 54.35% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 23m 46s. Estimated total time: 20h 0m 48s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 1s, 500 more iterations: 3h 20m 8s. [2025-11-13 11:43:51,302][__main__][INFO] - Starting iteration 562. [2025-11-13 11:43:51,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:43:51,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:01,413][__main__][INFO] - Number of regex retries in iteration 562: 0 [2025-11-13 11:44:01,414][__main__][INFO] - agents played in iteration 562 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:44:01,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:01,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:01,930][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:01,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:01,964][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:01,964][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:02,633][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:03,572][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:03,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:04,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:04,872][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:05,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:05,853][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:06,171][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:06,823][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:07,469][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:07,793][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:08,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:08,442][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:09,421][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:09,753][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:10,072][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:10,398][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:10,725][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:11,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:11,699][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:12,025][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:12,356][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:12,673][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:12,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:13,714][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:14,399][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:14,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:14,403][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:15,204][__main__][INFO] - Iteration 563 took 23s (42.29% Gen, 54.35% Train). Generation: 10s, Training: 12s. Estimated remaining time: 19h 17m 34s. Estimated total time: 19h 54m 59s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 49s, 500 more iterations: 3h 19m 9s. [2025-11-13 11:44:15,206][__main__][INFO] - Starting iteration 563. [2025-11-13 11:44:15,208][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:15,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:25,030][__main__][INFO] - Number of regex retries in iteration 563: 0 [2025-11-13 11:44:25,031][__main__][INFO] - agents played in iteration 563 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:44:25,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:25,482][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:25,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:25,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:25,568][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:25,568][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:26,246][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:26,542][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:26,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:27,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:27,837][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:28,162][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:28,489][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:29,140][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:29,794][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:30,122][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:30,772][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:31,094][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:31,417][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:31,740][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:32,063][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:32,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:32,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:33,366][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:34,013][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:34,661][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:34,985][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:35,315][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:35,640][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:35,964][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:36,288][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:44:36,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:44:37,335][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:44:38,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:44:38,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:44:38,056][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:44:38,908][__main__][INFO] - Iteration 564 took 23s (41.44% Gen, 54.96% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 7m 11s. Estimated total time: 19h 45m 0s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 30s, 500 more iterations: 3h 17m 30s. [2025-11-13 11:44:38,909][__main__][INFO] - Starting iteration 564. [2025-11-13 11:44:38,912][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:44:38,912][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:44:48,454][__main__][INFO] - Number of regex retries in iteration 564: 0 [2025-11-13 11:44:48,455][__main__][INFO] - agents played in iteration 564 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:44:48,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:48,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:48,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:48,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:44:48,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:44:48,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:44:49,648][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:44:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:44:50,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:44:50,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:44:50,912][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:44:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:44:51,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:44:51,881][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:44:52,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:44:52,533][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:44:52,859][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:44:53,182][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:44:53,506][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:44:53,832][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:44:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:44:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:44:54,813][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:44:55,137][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:44:55,461][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:44:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:44:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:44:56,435][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:44:56,761][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:44:57,088][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:44:57,417][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:44:57,752][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:44:58,076][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:44:58,399][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:44:58,724][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:44:59,054][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:44:59,378][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:44:59,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:00,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:00,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:01,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:01,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:01,464][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:02,238][__main__][INFO] - Iteration 565 took 23s (40.90% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 48m 8s. Estimated total time: 19h 26m 21s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 52s, 500 more iterations: 3h 14m 23s. [2025-11-13 11:45:02,240][__main__][INFO] - Starting iteration 565. [2025-11-13 11:45:02,243][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:02,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:11,765][__main__][INFO] - Number of regex retries in iteration 565: 0 [2025-11-13 11:45:11,765][__main__][INFO] - agents played in iteration 565 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:45:12,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:12,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:12,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:12,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:12,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:12,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:13,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:13,590][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:13,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:14,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:14,572][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:15,211][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:15,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:16,181][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:16,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:16,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:17,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:17,808][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:18,783][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:19,109][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:19,434][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:19,766][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:20,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:20,407][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:20,733][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:21,063][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:21,382][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:21,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:22,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:23,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:23,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:24,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:24,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:24,793][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:24,795][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:25,836][__main__][INFO] - Iteration 566 took 23s (40.35% Gen, 55.23% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 1m 6s. Estimated total time: 19h 39m 42s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 19s, 500 more iterations: 3h 16m 37s. [2025-11-13 11:45:25,838][__main__][INFO] - Starting iteration 566. [2025-11-13 11:45:25,841][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:25,841][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:35,577][__main__][INFO] - Number of regex retries in iteration 566: 0 [2025-11-13 11:45:35,578][__main__][INFO] - agents played in iteration 566 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:45:36,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:36,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:36,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:36,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:36,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:36,119][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:45:36,797][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:45:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:45:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:45:37,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:45:38,067][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:45:38,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:45:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:45:39,028][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:45:39,352][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:45:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:45:39,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:45:40,324][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:45:40,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:45:40,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:45:41,298][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:45:41,620][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:45:41,944][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:45:42,267][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:45:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:45:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:45:43,242][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:45:43,565][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:45:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:45:44,214][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:45:44,537][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:45:44,872][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:45:45,190][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:45:45,513][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:45:45,838][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:45:46,162][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:45:46,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:45:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:45:47,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:45:47,873][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:45:48,545][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:45:48,546][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:45:48,547][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:45:49,350][__main__][INFO] - Iteration 567 took 23s (41.42% Gen, 55.16% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 56m 32s. Estimated total time: 19h 35m 32s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 11s, 500 more iterations: 3h 15m 55s. [2025-11-13 11:45:49,352][__main__][INFO] - Starting iteration 567. [2025-11-13 11:45:49,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:45:49,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:45:58,898][__main__][INFO] - Number of regex retries in iteration 567: 0 [2025-11-13 11:45:58,899][__main__][INFO] - agents played in iteration 567 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:45:59,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:45:59,434][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:45:59,434][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:00,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:00,751][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:01,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:01,400][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:01,721][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:02,371][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:03,018][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:03,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:03,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:04,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:05,285][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:05,931][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:06,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:06,589][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:06,914][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:07,237][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:07,568][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:07,894][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:08,218][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:08,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:08,866][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:09,190][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:09,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:10,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:11,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:11,905][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:11,907][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:11,908][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:12,725][__main__][INFO] - Iteration 568 took 23s (40.83% Gen, 55.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 49m 8s. Estimated total time: 19h 28m 31s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 57s, 500 more iterations: 3h 14m 45s. [2025-11-13 11:46:12,727][__main__][INFO] - Starting iteration 568. [2025-11-13 11:46:12,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:12,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:22,865][__main__][INFO] - Number of regex retries in iteration 568: 0 [2025-11-13 11:46:22,866][__main__][INFO] - agents played in iteration 568 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:46:23,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:23,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:23,411][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:24,114][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:24,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:24,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:25,055][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:25,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:26,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:26,351][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:26,674][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:27,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:27,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:27,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:28,298][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:28,621][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:29,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:29,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:29,925][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:30,249][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:30,575][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:30,903][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:31,228][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:31,556][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:32,207][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:33,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:33,836][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:34,159][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:34,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:35,223][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:35,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:35,905][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:35,906][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:46:36,817][__main__][INFO] - Iteration 569 took 24s (42.08% Gen, 54.14% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 24m 37s. Estimated total time: 20h 4m 24s. Time estimates for 10 more iterations: 4m 0s, 100 more iterations: 40m 8s, 500 more iterations: 3h 20m 44s. [2025-11-13 11:46:36,819][__main__][INFO] - Starting iteration 569. [2025-11-13 11:46:36,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:46:36,822][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:46:46,465][__main__][INFO] - Number of regex retries in iteration 569: 0 [2025-11-13 11:46:46,466][__main__][INFO] - agents played in iteration 569 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:46:46,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:46,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:46,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:47,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:46:47,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:46:47,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:46:47,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:46:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:46:48,348][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:46:48,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:46:48,994][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:46:49,316][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:46:49,639][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:46:49,961][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:46:50,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:46:50,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:46:50,933][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:46:51,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:46:51,582][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:46:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:46:52,228][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:46:52,553][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:46:52,877][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:46:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:46:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:46:53,846][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:46:54,170][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:46:54,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:46:54,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:46:55,142][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:46:55,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:46:55,797][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:46:56,124][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:46:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:46:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:46:57,098][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:46:57,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:46:57,748][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:46:58,073][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:46:58,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:46:59,509][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:46:59,510][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:46:59,511][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:00,406][__main__][INFO] - Iteration 570 took 23s (40.89% Gen, 55.31% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 59m 5s. Estimated total time: 19h 39m 16s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 18s, 500 more iterations: 3h 16m 32s. [2025-11-13 11:47:00,408][__main__][INFO] - Starting iteration 570. [2025-11-13 11:47:00,411][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 11:47:00,412][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:10,741][__main__][INFO] - Number of regex retries in iteration 570: 0 [2025-11-13 11:47:10,742][__main__][INFO] - agents played in iteration 570 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:47:11,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:11,283][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:11,283][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:12,009][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:12,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:12,960][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:13,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:13,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:13,928][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:14,574][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:14,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:15,545][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:15,868][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:16,192][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:16,516][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:17,162][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:17,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:17,815][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:18,469][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:18,794][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:19,121][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:19,773][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:20,098][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:20,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:22,048][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:22,374][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:23,108][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:23,807][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:23,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:23,809][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:25,342][__main__][INFO] - Iteration 571 took 24s (41.43% Gen, 52.42% Train). Generation: 10s, Training: 13s. Estimated remaining time: 20h 5m 59s. Estimated total time: 20h 46m 34s. Time estimates for 10 more iterations: 4m 9s, 100 more iterations: 41m 33s, 500 more iterations: 3h 27m 45s. [2025-11-13 11:47:25,344][__main__][INFO] - Starting iteration 571. [2025-11-13 11:47:25,347][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:25,348][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:35,250][__main__][INFO] - Number of regex retries in iteration 571: 0 [2025-11-13 11:47:35,251][__main__][INFO] - agents played in iteration 571 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:47:35,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,764][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:35,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:35,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:36,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:47:36,777][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:47:37,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:47:37,427][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:47:37,751][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:47:38,074][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:47:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:47:38,725][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:47:39,052][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:47:39,378][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:47:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:47:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:47:40,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:47:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:47:41,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:47:41,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:47:41,677][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:47:42,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:47:42,324][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:47:42,648][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:47:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:47:43,297][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:47:43,624][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:47:43,949][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:47:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:47:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:47:44,925][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:47:45,250][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:47:45,573][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:47:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:47:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:47:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:47:46,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:47:47,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:47:48,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:47:48,273][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:47:48,275][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:47:49,078][__main__][INFO] - Iteration 572 took 23s (41.73% Gen, 54.88% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 5m 37s. Estimated total time: 19h 46m 36s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 33s, 500 more iterations: 3h 17m 46s. [2025-11-13 11:47:49,080][__main__][INFO] - Starting iteration 572. [2025-11-13 11:47:49,083][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:47:49,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:47:58,645][__main__][INFO] - Number of regex retries in iteration 572: 0 [2025-11-13 11:47:58,645][__main__][INFO] - agents played in iteration 572 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:47:59,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:59,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:59,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:59,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:47:59,199][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:47:59,199][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:47:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:00,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:00,808][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:01,131][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:01,455][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:01,781][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:02,104][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:02,754][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:03,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:03,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:03,736][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:04,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:04,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:04,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:05,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:05,689][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:06,013][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:06,338][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:07,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:07,635][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:07,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:08,283][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:08,607][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:09,254][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:09,578][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:09,902][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:10,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:10,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:11,639][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:11,640][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:11,642][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:12,463][__main__][INFO] - Iteration 573 took 23s (40.89% Gen, 55.59% Train). Generation: 9s, Training: 12s. Estimated remaining time: 18h 47m 38s. Estimated total time: 19h 29m 1s. Time estimates for 10 more iterations: 3m 53s, 100 more iterations: 38m 58s, 500 more iterations: 3h 14m 50s. [2025-11-13 11:48:12,465][__main__][INFO] - Starting iteration 573. [2025-11-13 11:48:12,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:12,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:22,792][__main__][INFO] - Number of regex retries in iteration 573: 0 [2025-11-13 11:48:22,792][__main__][INFO] - agents played in iteration 573 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:48:23,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:23,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:23,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:23,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:23,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:23,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:24,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:24,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:24,630][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:24,954][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:25,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:25,610][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:25,937][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:26,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:26,588][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:26,911][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:27,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:27,887][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:28,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:29,511][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:29,835][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:30,159][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:30,483][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:30,807][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:31,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:31,455][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:31,780][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:32,104][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:32,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:32,752][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:33,076][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:34,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:34,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:35,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:35,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:35,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:35,800][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:36,600][__main__][INFO] - Iteration 574 took 24s (42.78% Gen, 53.90% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 24m 55s. Estimated total time: 20h 6m 42s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 13s, 500 more iterations: 3h 21m 7s. [2025-11-13 11:48:36,602][__main__][INFO] - Starting iteration 574. [2025-11-13 11:48:36,605][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:36,606][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:48:45,516][__main__][INFO] - Number of regex retries in iteration 574: 0 [2025-11-13 11:48:45,516][__main__][INFO] - agents played in iteration 574 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:48:45,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:45,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:46,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:46,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:48:46,043][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:48:46,043][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:48:46,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:48:47,053][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:48:47,381][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:48:47,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:48:48,027][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:48:48,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:48:48,674][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:48:48,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:48:49,328][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:48:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:48:49,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:48:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:48:50,631][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:48:50,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:48:51,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:48:51,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:48:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:48:52,253][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:48:52,578][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:48:52,903][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:48:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:48:53,552][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:48:53,877][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:48:54,202][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:48:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:48:54,851][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:48:55,176][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:48:55,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:48:55,827][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:48:56,151][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:48:56,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:48:56,801][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:48:57,125][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:48:57,845][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:48:58,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:48:58,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:48:58,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:48:59,392][__main__][INFO] - Iteration 575 took 22s (39.10% Gen, 57.22% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 17m 14s. Estimated total time: 18h 59m 24s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 58s, 500 more iterations: 3h 9m 54s. [2025-11-13 11:48:59,395][__main__][INFO] - Starting iteration 575. [2025-11-13 11:48:59,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:48:59,400][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:09,107][__main__][INFO] - Number of regex retries in iteration 575: 0 [2025-11-13 11:49:09,107][__main__][INFO] - agents played in iteration 575 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:49:09,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:09,648][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:09,648][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:10,364][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:10,659][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:10,987][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:11,316][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:11,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:11,960][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:13,261][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:13,586][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:13,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:14,244][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:14,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:14,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:15,868][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:16,192][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:16,521][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:16,844][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:17,169][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:17,827][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:18,800][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:19,125][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:19,449][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:19,775][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:20,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:20,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:20,755][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:21,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:22,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:22,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:22,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:23,177][__main__][INFO] - Iteration 576 took 23s (40.82% Gen, 55.07% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 6m 25s. Estimated total time: 19h 48m 59s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 37s, 500 more iterations: 3h 18m 9s. [2025-11-13 11:49:23,179][__main__][INFO] - Starting iteration 576. [2025-11-13 11:49:23,186][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:23,188][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:32,322][__main__][INFO] - Number of regex retries in iteration 576: 0 [2025-11-13 11:49:32,322][__main__][INFO] - agents played in iteration 576 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:49:32,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:32,882][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:32,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:33,580][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:33,877][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:34,202][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:34,532][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:34,856][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:35,504][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:35,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:36,151][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:36,480][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:49:36,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:49:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:49:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:49:37,780][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:49:38,105][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:49:38,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:49:38,757][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:49:39,081][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:49:39,407][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:49:39,735][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:49:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:49:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:49:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:49:41,042][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:49:41,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:49:41,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:49:42,015][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:49:42,339][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:49:42,664][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:49:42,988][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:49:43,318][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:49:43,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:49:43,969][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:49:44,689][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:49:45,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:49:45,371][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:49:45,373][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:49:46,241][__main__][INFO] - Iteration 577 took 23s (39.60% Gen, 56.60% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 59s. Estimated total time: 19h 12m 55s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 25s, 500 more iterations: 3h 12m 9s. [2025-11-13 11:49:46,243][__main__][INFO] - Starting iteration 577. [2025-11-13 11:49:46,247][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:49:46,247][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:49:55,706][__main__][INFO] - Number of regex retries in iteration 577: 0 [2025-11-13 11:49:55,706][__main__][INFO] - agents played in iteration 577 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:49:56,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:49:56,300][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:49:56,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:49:57,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:49:57,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:49:57,640][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:49:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:49:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:49:58,619][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:49:58,944][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:49:59,270][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:49:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:49:59,921][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:00,246][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:00,571][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:00,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:01,220][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:02,523][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:02,847][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:03,171][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:03,496][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:03,821][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:04,145][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:04,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:04,796][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:05,446][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:06,419][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:06,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:07,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:07,396][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:08,129][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:08,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:08,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:08,840][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:09,729][__main__][INFO] - Iteration 578 took 23s (40.28% Gen, 55.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 50m 50s. Estimated total time: 19h 34m 10s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 8s, 500 more iterations: 3h 15m 41s. [2025-11-13 11:50:09,731][__main__][INFO] - Starting iteration 578. [2025-11-13 11:50:09,734][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:09,735][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:19,346][__main__][INFO] - Number of regex retries in iteration 578: 0 [2025-11-13 11:50:19,347][__main__][INFO] - agents played in iteration 578 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:50:19,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:19,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:19,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:20,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:21,242][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:21,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:21,891][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:22,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:22,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:22,866][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:23,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:23,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:24,167][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:24,817][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:25,469][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:25,796][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:26,121][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:26,445][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:26,770][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:27,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:28,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:28,391][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:28,716][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:29,365][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:29,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:30,339][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:30,671][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:30,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:31,730][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:32,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:32,435][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:32,437][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:33,455][__main__][INFO] - Iteration 579 took 23s (40.52% Gen, 55.18% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 2m 20s. Estimated total time: 19h 46m 4s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 32s, 500 more iterations: 3h 17m 40s. [2025-11-13 11:50:33,457][__main__][INFO] - Starting iteration 579. [2025-11-13 11:50:33,459][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:33,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:50:42,716][__main__][INFO] - Number of regex retries in iteration 579: 0 [2025-11-13 11:50:42,717][__main__][INFO] - agents played in iteration 579 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:50:43,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:43,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:43,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:43,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:50:43,258][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:50:43,258][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:50:43,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:50:44,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:50:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:50:44,926][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:50:45,251][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:50:45,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:50:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:50:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:50:46,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:50:46,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:50:47,197][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:50:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:50:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:50:48,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:50:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:50:48,826][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:50:49,152][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:50:49,477][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:50:49,801][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:50:50,125][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:50:50,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:50:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:50:51,098][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:50:51,422][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:50:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:50:52,072][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:50:52,398][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:50:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:50:53,053][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:50:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:50:53,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:50:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:50:54,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:50:55,075][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:50:55,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:50:55,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:50:55,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:50:56,662][__main__][INFO] - Iteration 580 took 23s (39.90% Gen, 56.40% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 36m 3s. Estimated total time: 19h 20m 10s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 40s, 500 more iterations: 3h 13m 21s. [2025-11-13 11:50:56,664][__main__][INFO] - Starting iteration 580. [2025-11-13 11:50:56,667][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 11:50:56,667][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:05,733][__main__][INFO] - Number of regex retries in iteration 580: 0 [2025-11-13 11:51:05,734][__main__][INFO] - agents played in iteration 580 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:51:06,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:06,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:06,290][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:07,018][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:07,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:07,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:08,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:08,934][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:09,582][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:09,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:10,233][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:10,886][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:11,219][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:11,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:12,527][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:13,494][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:13,819][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:14,144][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:14,469][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:15,127][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:15,444][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:16,093][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:16,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:17,066][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:17,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:18,120][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:18,842][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:18,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:18,845][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:20,566][__main__][INFO] - Iteration 581 took 23s (37.93% Gen, 54.86% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 10m 29s. Estimated total time: 19h 55m 0s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 10s. [2025-11-13 11:51:20,568][__main__][INFO] - Starting iteration 581. [2025-11-13 11:51:20,571][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:20,571][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:29,943][__main__][INFO] - Number of regex retries in iteration 581: 0 [2025-11-13 11:51:29,943][__main__][INFO] - agents played in iteration 581 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:51:30,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:30,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:30,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:30,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:30,535][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:30,536][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:31,875][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:32,525][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:32,855][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:33,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:33,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:34,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:34,807][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:35,456][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:35,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:36,755][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:51:37,079][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:51:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:51:37,734][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:51:38,059][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:51:38,383][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:51:38,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:51:39,032][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:51:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:51:39,682][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:51:40,008][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:51:40,331][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:51:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:51:40,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:51:41,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:51:41,630][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:51:42,338][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:51:43,072][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:51:43,074][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:51:43,075][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:51:43,866][__main__][INFO] - Iteration 582 took 23s (40.23% Gen, 56.37% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 39m 54s. Estimated total time: 19h 24m 49s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 8s. [2025-11-13 11:51:43,868][__main__][INFO] - Starting iteration 582. [2025-11-13 11:51:43,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:51:43,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:51:53,075][__main__][INFO] - Number of regex retries in iteration 582: 0 [2025-11-13 11:51:53,076][__main__][INFO] - agents played in iteration 582 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:51:53,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:53,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:53,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:53,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:51:53,637][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:51:53,637][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:51:54,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:51:54,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:51:54,990][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:51:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:51:55,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:51:55,960][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:51:56,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:51:56,609][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:51:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:51:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:51:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:51:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:51:58,234][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:51:58,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:51:58,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:51:59,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:51:59,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:51:59,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:00,859][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:01,183][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:01,509][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:01,833][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:02,806][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:03,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:03,781][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:04,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:04,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:05,472][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:06,159][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:06,160][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:06,162][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:06,960][__main__][INFO] - Iteration 583 took 23s (39.86% Gen, 56.68% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 29m 12s. Estimated total time: 19h 14m 29s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 28s, 500 more iterations: 3h 12m 24s. [2025-11-13 11:52:06,962][__main__][INFO] - Starting iteration 583. [2025-11-13 11:52:06,965][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:06,965][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:16,409][__main__][INFO] - Number of regex retries in iteration 583: 0 [2025-11-13 11:52:16,410][__main__][INFO] - agents played in iteration 583 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:52:16,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:16,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:16,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:16,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:16,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:16,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:17,658][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:17,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:18,276][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:19,247][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:19,571][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:19,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:20,548][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:20,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:21,195][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:21,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:21,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:22,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:22,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:23,158][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:24,130][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:24,457][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:24,782][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:25,433][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:25,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:26,083][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:26,409][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:26,734][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:27,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:27,709][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:28,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:28,744][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:29,443][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:29,445][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:29,447][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:30,245][__main__][INFO] - Iteration 584 took 23s (40.57% Gen, 56.00% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 38m 22s. Estimated total time: 19h 24m 3s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 48s, 500 more iterations: 3h 14m 0s. [2025-11-13 11:52:30,247][__main__][INFO] - Starting iteration 584. [2025-11-13 11:52:30,249][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:30,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:52:40,032][__main__][INFO] - Number of regex retries in iteration 584: 0 [2025-11-13 11:52:40,032][__main__][INFO] - agents played in iteration 584 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:52:40,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:40,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:40,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:40,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:52:40,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:52:40,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:52:41,310][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:52:41,607][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:52:41,932][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:52:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:52:42,580][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:52:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:52:43,229][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:52:43,560][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:52:43,878][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:52:44,202][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:52:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:52:44,857][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:52:45,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:52:45,502][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:52:45,826][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:52:46,151][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:52:46,475][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:52:46,802][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:52:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:52:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:52:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:52:48,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:52:48,426][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:52:48,756][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:52:49,078][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:52:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:52:49,728][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:52:50,053][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:52:50,377][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:52:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:52:51,031][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:52:51,358][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:52:51,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:52:52,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:52:53,088][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:52:53,090][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:52:53,091][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:52:53,918][__main__][INFO] - Iteration 585 took 23s (41.33% Gen, 55.17% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 57m 25s. Estimated total time: 19h 43m 29s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 26s, 500 more iterations: 3h 17m 14s. [2025-11-13 11:52:53,920][__main__][INFO] - Starting iteration 585. [2025-11-13 11:52:53,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:52:53,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:02,921][__main__][INFO] - Number of regex retries in iteration 585: 0 [2025-11-13 11:53:02,922][__main__][INFO] - agents played in iteration 585 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:53:03,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:03,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:03,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:03,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:03,504][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:03,505][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:04,216][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:04,511][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:05,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:06,141][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:06,464][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:06,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:07,443][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:07,770][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:08,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:08,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:08,746][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:09,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:09,398][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:09,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:10,051][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:11,033][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:11,354][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:11,678][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:12,002][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:12,337][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:12,655][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:12,980][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:13,308][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:13,637][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:14,289][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:14,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:15,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:16,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:16,017][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:16,018][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:16,839][__main__][INFO] - Iteration 586 took 22s (39.26% Gen, 57.15% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 20s. Estimated total time: 19h 5m 47s. Time estimates for 10 more iterations: 3m 49s, 100 more iterations: 38m 11s, 500 more iterations: 3h 10m 57s. [2025-11-13 11:53:16,841][__main__][INFO] - Starting iteration 586. [2025-11-13 11:53:16,844][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:16,844][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:26,321][__main__][INFO] - Number of regex retries in iteration 586: 0 [2025-11-13 11:53:26,322][__main__][INFO] - agents played in iteration 586 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:53:26,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:26,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:26,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:26,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:26,917][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:26,918][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:27,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:27,923][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:28,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:28,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:28,902][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:29,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:29,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:29,882][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:30,206][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:30,530][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:30,855][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:31,502][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:32,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:32,476][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:33,772][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:34,096][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:34,423][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:34,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:35,071][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:35,719][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:36,367][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:53:36,690][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:53:37,016][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:53:37,344][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:53:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:53:37,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:53:38,707][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:53:39,381][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:53:39,382][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:53:39,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:53:40,265][__main__][INFO] - Iteration 587 took 23s (40.46% Gen, 55.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 44m 17s. Estimated total time: 19h 31m 8s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 2s, 500 more iterations: 3h 15m 11s. [2025-11-13 11:53:40,267][__main__][INFO] - Starting iteration 587. [2025-11-13 11:53:40,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:53:40,271][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:53:49,840][__main__][INFO] - Number of regex retries in iteration 587: 0 [2025-11-13 11:53:49,841][__main__][INFO] - agents played in iteration 587 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:53:50,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:50,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:50,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:50,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:53:50,376][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:53:50,376][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:53:51,092][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:53:51,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:53:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:53:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:53:52,374][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:53:52,698][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:53:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:53:53,351][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:53:53,676][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:53:54,003][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:53:54,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:53:54,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:53:54,984][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:53:55,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:53:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:53:55,964][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:53:56,289][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:53:56,614][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:53:56,938][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:53:57,265][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:53:57,589][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:53:57,916][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:53:58,240][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:53:58,568][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:53:58,898][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:53:59,226][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:53:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:53:59,879][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:00,206][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:00,531][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:00,858][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:01,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:02,217][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:02,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:02,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:02,919][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:03,799][__main__][INFO] - Iteration 588 took 23s (40.67% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 49m 13s. Estimated total time: 19h 36m 27s. Time estimates for 10 more iterations: 3m 55s, 100 more iterations: 39m 12s, 500 more iterations: 3h 16m 4s. [2025-11-13 11:54:03,802][__main__][INFO] - Starting iteration 588. [2025-11-13 11:54:03,804][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:03,805][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:09,569][mllm.models.large_language_model_local][WARNING] - Response <{A}> did not match regex: (|), retry 1/1 [2025-11-13 11:54:14,015][__main__][INFO] - Number of regex retries in iteration 588: 1 [2025-11-13 11:54:14,016][__main__][INFO] - agents played in iteration 588 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:54:14,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:14,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:14,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:14,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:14,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:14,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:15,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:15,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:16,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:17,196][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:17,846][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:18,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:18,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:19,143][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:19,793][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:20,118][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:20,770][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:21,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:21,421][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:21,748][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:22,721][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:23,377][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:24,020][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:24,343][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:24,674][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:24,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:25,317][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:25,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:26,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:27,081][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:27,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:27,084][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:27,940][__main__][INFO] - Iteration 589 took 24s (42.31% Gen, 54.14% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 19m 10s. Estimated total time: 20h 6m 48s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 13s, 500 more iterations: 3h 21m 8s. [2025-11-13 11:54:27,942][__main__][INFO] - Starting iteration 589. [2025-11-13 11:54:27,945][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:27,946][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:36,692][__main__][INFO] - Number of regex retries in iteration 589: 0 [2025-11-13 11:54:36,693][__main__][INFO] - agents played in iteration 589 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:54:37,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:37,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:37,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:37,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:54:37,237][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:54:37,237][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:54:37,947][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:54:38,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:54:38,581][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:54:38,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:54:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:54:39,554][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:54:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:54:40,209][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:54:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:54:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:54:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:54:41,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:54:41,837][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:54:42,165][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:54:42,493][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:54:42,818][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:54:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:54:43,469][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:54:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:54:44,120][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:54:44,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:54:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:54:45,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:54:45,424][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:54:45,753][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:54:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:54:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:54:46,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:54:47,060][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:54:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:54:47,714][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:54:48,039][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:54:48,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:54:49,094][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:54:49,810][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:54:49,811][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:54:49,813][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:54:50,700][__main__][INFO] - Iteration 590 took 22s (38.44% Gen, 57.66% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 9m 47s. Estimated total time: 18h 57m 48s. Time estimates for 10 more iterations: 3m 47s, 100 more iterations: 37m 55s, 500 more iterations: 3h 9m 38s. [2025-11-13 11:54:50,702][__main__][INFO] - Starting iteration 590. [2025-11-13 11:54:50,706][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 11:54:50,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:54:59,952][__main__][INFO] - Number of regex retries in iteration 590: 0 [2025-11-13 11:54:59,952][__main__][INFO] - agents played in iteration 590 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:55:00,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:00,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:00,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:00,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:00,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:00,523][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:01,220][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:01,515][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:01,841][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:02,168][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:02,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:03,152][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:03,806][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:04,133][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:04,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:05,120][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:05,770][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:06,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:06,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:07,394][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:07,721][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:08,374][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:08,698][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:09,023][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:09,346][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:09,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:10,007][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:10,657][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:11,314][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:11,638][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:12,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:13,037][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:13,038][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:13,040][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:14,651][__main__][INFO] - Iteration 591 took 23s (38.61% Gen, 54.65% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 8m 54s. Estimated total time: 19h 57m 18s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 54s, 500 more iterations: 3h 19m 33s. [2025-11-13 11:55:14,653][__main__][INFO] - Starting iteration 591. [2025-11-13 11:55:14,656][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:14,656][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:23,716][__main__][INFO] - Number of regex retries in iteration 591: 0 [2025-11-13 11:55:23,716][__main__][INFO] - agents played in iteration 591 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:55:24,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:24,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:24,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:24,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:24,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:24,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:25,084][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:25,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:26,038][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:26,693][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:27,020][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:27,347][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:27,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:27,998][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:28,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:28,647][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:29,296][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:29,620][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:29,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:30,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:30,918][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:31,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:31,574][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:31,893][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:32,876][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:33,521][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:33,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:34,182][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:34,502][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:34,827][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:35,152][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:35,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:36,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:55:36,884][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:55:36,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:55:36,889][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:55:37,733][__main__][INFO] - Iteration 592 took 23s (39.26% Gen, 57.08% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 6s. Estimated total time: 19h 13m 54s. Time estimates for 10 more iterations: 3m 50s, 100 more iterations: 38m 27s, 500 more iterations: 3h 12m 19s. [2025-11-13 11:55:37,735][__main__][INFO] - Starting iteration 592. [2025-11-13 11:55:37,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:55:37,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:55:47,310][__main__][INFO] - Number of regex retries in iteration 592: 0 [2025-11-13 11:55:47,311][__main__][INFO] - agents played in iteration 592 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:55:47,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:47,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:47,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:47,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:55:47,880][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:55:47,880][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:55:48,609][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:55:48,904][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:55:49,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:55:49,560][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:55:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:55:50,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:55:50,533][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:55:50,867][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:55:51,197][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:55:51,526][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:55:51,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:55:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:55:52,509][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:55:52,833][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:55:53,169][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:55:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:55:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:55:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:55:54,473][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:55:54,798][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:55:55,123][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:55:55,448][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:55:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:55:56,100][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:55:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:55:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:55:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:55:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:55:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:55:58,048][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:55:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:55:58,701][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:55:59,027][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:55:59,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:00,446][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:00,447][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:00,449][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:01,373][__main__][INFO] - Iteration 593 took 23s (40.50% Gen, 55.58% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 52m 38s. Estimated total time: 19h 41m 50s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 23s, 500 more iterations: 3h 16m 58s. [2025-11-13 11:56:01,375][__main__][INFO] - Starting iteration 593. [2025-11-13 11:56:01,379][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:01,379][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:10,541][__main__][INFO] - Number of regex retries in iteration 593: 0 [2025-11-13 11:56:10,542][__main__][INFO] - agents played in iteration 593 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:56:10,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:11,091][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:11,091][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:11,841][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:12,462][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:12,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:13,115][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:13,439][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:13,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:14,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:14,416][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:14,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:15,392][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:15,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:16,693][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:17,345][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:17,673][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:18,327][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:18,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:18,981][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:19,629][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:19,953][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:20,278][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:20,602][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:20,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:21,252][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:21,577][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:21,910][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:22,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:22,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:23,642][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:23,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:23,645][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:24,640][__main__][INFO] - Iteration 594 took 23s (39.39% Gen, 56.33% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 33m 30s. Estimated total time: 19h 23m 5s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 46s, 500 more iterations: 3h 13m 50s. [2025-11-13 11:56:24,642][__main__][INFO] - Starting iteration 594. [2025-11-13 11:56:24,645][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:24,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:33,546][__main__][INFO] - Number of regex retries in iteration 594: 0 [2025-11-13 11:56:33,547][__main__][INFO] - agents played in iteration 594 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:56:33,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:34,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:34,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:34,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:34,113][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:34,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:34,830][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:35,124][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:35,449][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:35,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:36,098][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:36,425][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:56:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:56:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:56:37,720][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:56:38,044][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:56:38,369][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:56:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:56:39,017][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:56:39,344][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:56:39,669][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:56:39,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:56:40,319][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:56:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:56:40,973][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:56:41,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:56:41,624][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:56:41,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:56:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:56:42,597][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:56:42,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:56:43,246][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:56:43,570][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:56:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:56:44,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:56:44,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:56:44,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:56:45,191][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:56:45,919][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:56:46,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:56:46,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:56:46,612][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:56:47,517][__main__][INFO] - Iteration 595 took 22s (38.91% Gen, 57.12% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 13m 40s. Estimated total time: 19h 3m 37s. Time estimates for 10 more iterations: 3m 48s, 100 more iterations: 38m 7s, 500 more iterations: 3h 10m 36s. [2025-11-13 11:56:47,519][__main__][INFO] - Starting iteration 595. [2025-11-13 11:56:47,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:56:47,523][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:56:56,733][__main__][INFO] - Number of regex retries in iteration 595: 0 [2025-11-13 11:56:56,734][__main__][INFO] - agents played in iteration 595 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:56:57,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:57,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:57,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:57,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:56:57,286][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:56:57,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:56:57,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:56:58,288][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:56:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:56:58,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:56:59,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:56:59,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:56:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:00,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:00,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:00,885][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:01,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:02,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:02,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:02,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:03,156][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:03,808][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:04,136][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:04,460][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:04,784][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:05,759][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:06,092][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:06,741][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:07,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:08,042][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:08,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:09,060][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:09,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:09,771][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:09,773][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:10,679][__main__][INFO] - Iteration 596 took 23s (39.77% Gen, 56.30% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 27m 33s. Estimated total time: 19h 17m 54s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 35s, 500 more iterations: 3h 12m 59s. [2025-11-13 11:57:10,681][__main__][INFO] - Starting iteration 596. [2025-11-13 11:57:10,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:10,685][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:19,880][__main__][INFO] - Number of regex retries in iteration 596: 0 [2025-11-13 11:57:19,881][__main__][INFO] - agents played in iteration 596 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:57:20,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:20,436][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:20,437][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:21,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:22,453][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:22,778][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:23,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:23,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:24,082][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:24,738][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:25,062][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:25,388][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:25,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:26,040][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:26,690][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:27,669][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:27,994][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:28,319][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:28,967][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:29,291][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:29,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:30,592][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:30,924][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:31,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:31,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:32,276][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:32,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:32,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:32,986][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:33,909][__main__][INFO] - Iteration 597 took 23s (39.59% Gen, 56.43% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 30m 31s. Estimated total time: 19h 21m 15s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 42s, 500 more iterations: 3h 13m 32s. [2025-11-13 11:57:33,911][__main__][INFO] - Starting iteration 597. [2025-11-13 11:57:33,914][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:33,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:57:43,211][__main__][INFO] - Number of regex retries in iteration 597: 0 [2025-11-13 11:57:43,211][__main__][INFO] - agents played in iteration 597 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:57:43,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:57:43,767][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:57:43,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:57:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:57:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:57:45,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:57:45,439][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:57:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:57:46,088][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:57:46,412][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:57:46,738][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:57:47,062][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:57:47,386][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:57:47,711][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:57:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:57:48,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:57:48,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:57:49,007][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:57:49,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:57:49,660][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:57:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:57:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:57:50,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:57:50,958][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:57:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:57:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:57:51,934][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:57:52,258][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:57:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:57:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:57:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:57:53,558][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:57:53,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:57:54,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:57:54,535][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:57:54,860][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:57:55,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:57:56,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:57:56,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:57:56,287][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:57:57,204][__main__][INFO] - Iteration 598 took 23s (39.92% Gen, 56.14% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 33m 24s. Estimated total time: 19h 24m 31s. Time estimates for 10 more iterations: 3m 52s, 100 more iterations: 38m 49s, 500 more iterations: 3h 14m 5s. [2025-11-13 11:57:57,205][__main__][INFO] - Starting iteration 598. [2025-11-13 11:57:57,209][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:57:57,209][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:06,653][__main__][INFO] - Number of regex retries in iteration 598: 0 [2025-11-13 11:58:06,654][__main__][INFO] - agents played in iteration 598 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:58:07,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:07,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:07,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:07,207][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:07,208][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:07,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:07,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:08,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:08,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:09,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:09,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:10,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:10,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:11,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:11,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:12,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:12,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:12,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:13,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:13,786][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:14,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:14,437][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:14,761][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:15,086][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:15,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:15,737][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:16,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:16,385][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:16,710][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:17,033][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:17,681][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:18,007][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:18,331][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:19,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:19,713][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:19,715][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:19,716][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:20,623][__main__][INFO] - Iteration 599 took 23s (40.33% Gen, 55.79% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 39m 15s. Estimated total time: 19h 30m 46s. Time estimates for 10 more iterations: 3m 54s, 100 more iterations: 39m 1s, 500 more iterations: 3h 15m 7s. [2025-11-13 11:58:20,625][__main__][INFO] - Starting iteration 599. [2025-11-13 11:58:20,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:58:20,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:29,923][__main__][INFO] - Number of regex retries in iteration 599: 0 [2025-11-13 11:58:29,923][__main__][INFO] - agents played in iteration 599 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:58:30,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:30,410][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:30,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:30,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:30,477][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:30,478][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:31,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:31,521][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:31,846][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:32,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:32,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:33,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:33,471][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:34,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:34,770][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:35,100][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:35,426][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:35,756][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:36,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:36,739][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:58:37,064][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:58:37,388][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:58:37,724][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:58:38,050][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:58:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:58:38,703][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:58:39,027][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:58:39,353][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:58:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:58:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:58:40,332][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:58:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:58:40,986][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:58:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:58:41,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:58:42,320][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:58:43,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:58:43,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:58:43,047][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:58:44,246][__main__][INFO] - Iteration 600 took 23s (39.35% Gen, 55.57% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 48m 59s. Estimated total time: 19h 40m 54s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 21s, 500 more iterations: 3h 16m 49s. [2025-11-13 11:58:44,248][__main__][INFO] - Starting iteration 600. [2025-11-13 11:58:44,251][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 11:58:44,252][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:58:53,174][__main__][INFO] - Number of regex retries in iteration 600: 0 [2025-11-13 11:58:53,174][__main__][INFO] - agents played in iteration 600 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:58:53,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:53,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:53,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:53,729][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:58:53,730][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:58:53,730][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:58:54,475][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:58:54,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:58:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:58:55,421][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:58:55,749][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:58:56,073][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:58:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:58:56,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:58:57,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:58:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:58:57,694][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:58:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:58:58,346][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:58:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:58:59,000][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:58:59,323][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:58:59,649][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:58:59,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:00,299][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:01,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:02,253][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:02,578][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:02,903][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:03,227][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:03,555][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:03,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:04,207][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:04,533][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:04,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:05,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:06,256][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:06,257][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:06,259][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:08,498][__main__][INFO] - Iteration 601 took 24s (36.80% Gen, 53.96% Train). Generation: 8s, Training: 13s. Estimated remaining time: 19h 20m 5s. Estimated total time: 20h 12m 23s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 24s, 500 more iterations: 3h 22m 3s. [2025-11-13 11:59:08,501][__main__][INFO] - Starting iteration 601. [2025-11-13 11:59:08,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:08,504][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:17,672][__main__][INFO] - Number of regex retries in iteration 601: 0 [2025-11-13 11:59:17,673][__main__][INFO] - agents played in iteration 601 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:59:18,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:18,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:18,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:18,972][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:19,268][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:19,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:19,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:20,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:20,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:20,890][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:21,866][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:22,516][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:23,165][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:23,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:23,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:24,463][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:25,113][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:25,762][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:26,089][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:26,413][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:26,744][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:27,066][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:27,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:27,714][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:28,362][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:29,335][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:30,049][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:30,766][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:30,768][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:30,770][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:31,670][__main__][INFO] - Iteration 602 took 23s (39.57% Gen, 56.53% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 25m 39s. Estimated total time: 19h 18m 21s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 36s, 500 more iterations: 3h 13m 3s. [2025-11-13 11:59:31,672][__main__][INFO] - Starting iteration 602. [2025-11-13 11:59:31,675][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:31,675][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 11:59:41,261][__main__][INFO] - Number of regex retries in iteration 602: 0 [2025-11-13 11:59:41,262][__main__][INFO] - agents played in iteration 602 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 11:59:41,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 11:59:41,818][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 11:59:41,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 11:59:42,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 11:59:42,854][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 11:59:43,179][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 11:59:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 11:59:43,826][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 11:59:44,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 11:59:44,475][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 11:59:44,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 11:59:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 11:59:45,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 11:59:45,781][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 11:59:46,104][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 11:59:46,431][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 11:59:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 11:59:47,080][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 11:59:47,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 11:59:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 11:59:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 11:59:48,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 11:59:48,707][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 11:59:49,033][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 11:59:49,358][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 11:59:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 11:59:50,008][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 11:59:50,337][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 11:59:50,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 11:59:50,985][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 11:59:51,310][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 11:59:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 11:59:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 11:59:52,285][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 11:59:52,611][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 11:59:52,934][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 11:59:53,636][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 11:59:54,356][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 11:59:54,357][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 11:59:54,359][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 11:59:55,583][__main__][INFO] - Iteration 603 took 23s (40.10% Gen, 54.78% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 2m 21s. Estimated total time: 19h 55m 27s. Time estimates for 10 more iterations: 3m 59s, 100 more iterations: 39m 50s, 500 more iterations: 3h 19m 14s. [2025-11-13 11:59:55,585][__main__][INFO] - Starting iteration 603. [2025-11-13 11:59:55,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 11:59:55,589][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:04,516][__main__][INFO] - Number of regex retries in iteration 603: 0 [2025-11-13 12:00:04,517][__main__][INFO] - agents played in iteration 603 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:00:04,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:05,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:05,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:05,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:05,079][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:05,080][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:05,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:06,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:06,759][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:07,084][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:07,735][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:08,059][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:08,384][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:08,709][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:09,033][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:09,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:10,338][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:10,662][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:11,318][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:11,967][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:12,292][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:12,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:13,278][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:13,601][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:13,927][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:14,251][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:14,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:14,900][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:15,229][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:15,553][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:15,878][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:16,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:16,925][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:17,659][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:17,661][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:17,663][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:19,286][__main__][INFO] - Iteration 604 took 23s (37.67% Gen, 55.47% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 51m 26s. Estimated total time: 19h 44m 55s. Time estimates for 10 more iterations: 3m 56s, 100 more iterations: 39m 29s, 500 more iterations: 3h 17m 29s. [2025-11-13 12:00:19,289][__main__][INFO] - Starting iteration 604. [2025-11-13 12:00:19,292][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:19,293][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:28,828][__main__][INFO] - Number of regex retries in iteration 604: 0 [2025-11-13 12:00:28,829][__main__][INFO] - agents played in iteration 604 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:00:29,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:29,403][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:29,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:30,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:30,429][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:30,754][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:31,077][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:31,401][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:31,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:32,056][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:33,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:33,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:34,970][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:35,295][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:35,619][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:00:36,267][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:00:36,590][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:00:36,915][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:00:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:00:37,564][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:00:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:00:38,211][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:00:38,535][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:00:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:00:39,188][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:00:39,514][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:00:39,843][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:00:40,168][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:00:40,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:00:41,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:00:41,901][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:00:41,902][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:00:41,904][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:00:43,097][__main__][INFO] - Iteration 605 took 23s (40.06% Gen, 54.93% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 56m 23s. Estimated total time: 19h 50m 16s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 40s, 500 more iterations: 3h 18m 22s. [2025-11-13 12:00:43,099][__main__][INFO] - Starting iteration 605. [2025-11-13 12:00:43,102][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:00:43,103][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:00:52,695][__main__][INFO] - Number of regex retries in iteration 605: 0 [2025-11-13 12:00:52,696][__main__][INFO] - agents played in iteration 605 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:00:53,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,230][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:00:53,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:00:53,265][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:00:53,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:00:54,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:00:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:00:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:00:55,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:00:55,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:00:55,895][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:00:56,218][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:00:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:00:56,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:00:57,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:00:57,521][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:00:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:00:58,171][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:00:58,497][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:00:58,821][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:00:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:00:59,474][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:00:59,798][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:00,121][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:00,767][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:01,091][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:01,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:02,063][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:02,386][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:03,032][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:03,357][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:04,006][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:04,332][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:05,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:05,760][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:05,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:05,763][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:06,961][__main__][INFO] - Iteration 606 took 23s (40.21% Gen, 54.77% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 58m 40s. Estimated total time: 19h 52m 57s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 45s, 500 more iterations: 3h 18m 49s. [2025-11-13 12:01:06,962][__main__][INFO] - Starting iteration 606. [2025-11-13 12:01:06,966][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:06,966][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:16,677][__main__][INFO] - Number of regex retries in iteration 606: 0 [2025-11-13 12:01:16,677][__main__][INFO] - agents played in iteration 606 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:01:17,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:17,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:17,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:17,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:17,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:17,261][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:17,986][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:18,281][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:18,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:18,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:19,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:19,583][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:19,903][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:20,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:20,551][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:20,879][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:21,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:21,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:21,848][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:22,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:22,501][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:22,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:23,148][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:23,476][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:23,795][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:24,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:24,443][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:24,773][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:25,415][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:26,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:26,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:26,722][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:27,048][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:27,375][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:27,701][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:28,026][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:28,351][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:29,054][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:29,756][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:29,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:29,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:31,266][__main__][INFO] - Iteration 607 took 24s (39.96% Gen, 53.84% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 20m 21s. Estimated total time: 20h 15m 2s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 30s, 500 more iterations: 3h 22m 30s. [2025-11-13 12:01:31,268][__main__][INFO] - Starting iteration 607. [2025-11-13 12:01:31,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:31,271][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:01:40,720][__main__][INFO] - Number of regex retries in iteration 607: 0 [2025-11-13 12:01:40,720][__main__][INFO] - agents played in iteration 607 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:01:41,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:01:41,284][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:01:41,285][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:01:42,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:01:42,341][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:01:42,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:01:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:01:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:01:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:01:43,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:01:44,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:01:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:01:44,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:01:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:01:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:01:45,937][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:01:46,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:01:46,590][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:01:46,916][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:01:47,239][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:01:47,567][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:01:47,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:01:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:01:48,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:01:48,869][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:01:49,198][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:01:49,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:01:49,848][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:01:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:01:50,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:01:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:01:51,156][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:01:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:01:51,811][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:01:52,137][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:01:52,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:01:53,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:01:53,837][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:01:53,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:01:53,840][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:01:54,986][__main__][INFO] - Iteration 608 took 23s (39.84% Gen, 55.32% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 50m 41s. Estimated total time: 19h 45m 47s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 37s. [2025-11-13 12:01:54,987][__main__][INFO] - Starting iteration 608. [2025-11-13 12:01:54,991][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:01:54,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:04,503][__main__][INFO] - Number of regex retries in iteration 608: 0 [2025-11-13 12:02:04,504][__main__][INFO] - agents played in iteration 608 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:02:04,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:04,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:05,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:05,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:05,065][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:05,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:05,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:06,098][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:06,422][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:06,746][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:07,073][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:07,397][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:07,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:08,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:08,695][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:09,021][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:09,668][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:09,992][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:10,315][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:10,639][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:11,292][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:11,616][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:12,266][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:13,237][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:13,563][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:13,894][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:14,546][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:14,873][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:15,200][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:15,527][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:15,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:16,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:16,878][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:17,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:17,580][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:17,582][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:18,703][__main__][INFO] - Iteration 609 took 23s (40.11% Gen, 55.16% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 50m 7s. Estimated total time: 19h 45m 36s. Time estimates for 10 more iterations: 3m 57s, 100 more iterations: 39m 31s, 500 more iterations: 3h 17m 36s. [2025-11-13 12:02:18,705][__main__][INFO] - Starting iteration 609. [2025-11-13 12:02:18,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:02:18,708][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:28,262][__main__][INFO] - Number of regex retries in iteration 609: 0 [2025-11-13 12:02:28,262][__main__][INFO] - agents played in iteration 609 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:02:28,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:28,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:28,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:29,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:29,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:30,506][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:30,831][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:31,155][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:31,481][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:31,804][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:32,129][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:32,453][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:33,101][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:33,425][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:33,748][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:34,075][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:34,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:35,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:35,369][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:35,693][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:02:36,018][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:02:36,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:02:36,666][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:02:36,991][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:02:37,320][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:02:37,644][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:02:37,973][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:02:38,296][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:02:38,622][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:02:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:02:39,272][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:02:39,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:02:39,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:02:40,637][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:02:41,369][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:02:41,371][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:02:41,372][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:02:42,960][__main__][INFO] - Iteration 610 took 24s (39.39% Gen, 54.05% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 16m 47s. Estimated total time: 20h 12m 40s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 25s, 500 more iterations: 3h 22m 6s. [2025-11-13 12:02:42,963][__main__][INFO] - Starting iteration 610. [2025-11-13 12:02:42,966][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 12:02:42,967][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:02:52,230][__main__][INFO] - Number of regex retries in iteration 610: 0 [2025-11-13 12:02:52,230][__main__][INFO] - agents played in iteration 610 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:02:52,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:52,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:52,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:52,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:02:52,796][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:02:52,797][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:02:53,535][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:02:53,829][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:02:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:02:54,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:02:54,802][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:02:55,126][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:02:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:02:55,774][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:02:56,097][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:02:56,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:02:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:02:57,076][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:02:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:02:57,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:02:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:02:58,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:02:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:02:59,022][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:02:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:02:59,679][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:00,000][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:00,326][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:00,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:00,983][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:01,304][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:01,629][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:01,957][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:02,610][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:02,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:03,913][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:04,608][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:05,319][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:05,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:05,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:07,344][__main__][INFO] - Iteration 611 took 24s (37.99% Gen, 53.70% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 22m 37s. Estimated total time: 20h 18m 55s. Time estimates for 10 more iterations: 4m 3s, 100 more iterations: 40m 37s, 500 more iterations: 3h 23m 9s. [2025-11-13 12:03:07,346][__main__][INFO] - Starting iteration 611. [2025-11-13 12:03:07,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:07,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:17,424][__main__][INFO] - Number of regex retries in iteration 611: 0 [2025-11-13 12:03:17,425][__main__][INFO] - agents played in iteration 611 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:03:17,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:17,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:17,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:17,976][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:17,977][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:17,977][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:18,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:19,009][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:19,334][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:19,659][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:20,308][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:20,633][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:20,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:21,296][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:21,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:21,943][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:22,278][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:22,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:23,250][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:23,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:24,234][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:24,561][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:25,216][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:25,542][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:25,869][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:26,195][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:26,525][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:26,851][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:27,177][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:27,501][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:27,827][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:28,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:28,802][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:29,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:29,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:30,535][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:30,536][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:30,538][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:31,593][__main__][INFO] - Iteration 612 took 24s (41.55% Gen, 54.08% Train). Generation: 10s, Training: 13s. Estimated remaining time: 19h 15m 33s. Estimated total time: 20h 12m 15s. Time estimates for 10 more iterations: 4m 2s, 100 more iterations: 40m 24s, 500 more iterations: 3h 22m 2s. [2025-11-13 12:03:31,595][__main__][INFO] - Starting iteration 612. [2025-11-13 12:03:31,597][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:31,598][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:03:40,802][__main__][INFO] - Number of regex retries in iteration 612: 0 [2025-11-13 12:03:40,803][__main__][INFO] - agents played in iteration 612 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:03:41,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:41,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:41,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:41,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:03:41,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:03:41,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:03:42,081][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:03:42,377][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:03:42,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:03:43,026][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:03:43,352][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:03:43,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:03:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:03:44,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:03:44,651][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:03:44,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:03:45,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:03:45,622][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:03:45,946][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:03:46,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:03:46,595][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:03:46,920][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:03:47,243][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:03:47,571][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:03:47,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:03:48,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:03:48,549][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:03:48,873][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:03:49,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:03:49,521][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:03:49,848][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:03:50,171][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:03:50,498][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:03:50,827][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:03:51,155][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:03:51,479][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:03:51,804][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:03:52,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:03:52,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:03:53,173][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:03:53,854][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:03:53,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:03:53,857][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:03:54,711][__main__][INFO] - Iteration 613 took 23s (39.82% Gen, 56.48% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 18m 39s. Estimated total time: 19h 15m 44s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 31s, 500 more iterations: 3h 12m 37s. [2025-11-13 12:03:54,713][__main__][INFO] - Starting iteration 613. [2025-11-13 12:03:54,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:03:54,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:04,443][__main__][INFO] - Number of regex retries in iteration 613: 0 [2025-11-13 12:04:04,444][__main__][INFO] - agents played in iteration 613 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:04:04,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:04,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:04,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:04,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:04,992][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:04,993][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:06,013][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:06,336][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:06,661][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:06,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:07,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:07,964][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:08,610][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:08,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:09,580][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:09,903][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:10,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:10,876][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:11,525][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:11,850][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:12,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:12,505][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:12,828][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:13,153][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:13,477][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:14,127][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:14,450][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:14,776][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:15,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:15,426][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:15,752][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:16,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:16,792][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:17,501][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:17,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:17,504][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:18,600][__main__][INFO] - Iteration 614 took 23s (40.73% Gen, 54.67% Train). Generation: 9s, Training: 13s. Estimated remaining time: 18h 56m 47s. Estimated total time: 19h 54m 16s. Time estimates for 10 more iterations: 3m 58s, 100 more iterations: 39m 48s, 500 more iterations: 3h 19m 2s. [2025-11-13 12:04:18,602][__main__][INFO] - Starting iteration 614. [2025-11-13 12:04:18,605][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:18,606][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:27,834][__main__][INFO] - Number of regex retries in iteration 614: 0 [2025-11-13 12:04:27,835][__main__][INFO] - agents played in iteration 614 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:04:28,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:28,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:28,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:28,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:28,390][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:28,391][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:29,124][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:29,418][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:29,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:30,067][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:31,699][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:32,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:32,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:33,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:33,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:34,296][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:34,945][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:04:36,241][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:04:36,566][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:04:36,890][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:04:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:04:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:04:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:04:38,193][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:04:38,518][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:04:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:04:39,165][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:04:39,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:04:40,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:04:40,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:04:40,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:04:40,895][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:04:42,781][__main__][INFO] - Iteration 615 took 24s (38.17% Gen, 54.02% Train). Generation: 9s, Training: 13s. Estimated remaining time: 19h 10m 56s. Estimated total time: 20h 8m 49s. Time estimates for 10 more iterations: 4m 1s, 100 more iterations: 40m 17s, 500 more iterations: 3h 21m 28s. [2025-11-13 12:04:42,783][__main__][INFO] - Starting iteration 615. [2025-11-13 12:04:42,786][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:04:42,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:04:51,686][__main__][INFO] - Number of regex retries in iteration 615: 0 [2025-11-13 12:04:51,687][__main__][INFO] - agents played in iteration 615 are Alice, Bob, Alice_buffer, Bob_buffer [2025-11-13 12:04:52,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:52,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:52,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:52,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 40.57%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 12:04:52,243][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 12:04:52,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 12:04:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 128 [2025-11-13 12:04:53,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 128 [2025-11-13 12:04:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 128 [2025-11-13 12:04:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 128 [2025-11-13 12:04:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 128 [2025-11-13 12:04:54,554][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 128 [2025-11-13 12:04:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 128 [2025-11-13 12:04:55,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 128 [2025-11-13 12:04:55,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 128 [2025-11-13 12:04:55,853][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 128 [2025-11-13 12:04:56,176][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 128 [2025-11-13 12:04:56,507][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 128 [2025-11-13 12:04:56,825][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 128 [2025-11-13 12:04:57,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 128 [2025-11-13 12:04:57,474][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 128 [2025-11-13 12:04:57,811][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 128 [2025-11-13 12:04:58,133][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 128 [2025-11-13 12:04:58,457][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 128 [2025-11-13 12:04:58,782][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 128 [2025-11-13 12:04:59,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 128 [2025-11-13 12:04:59,435][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 128 [2025-11-13 12:04:59,762][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 128 [2025-11-13 12:05:00,088][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 128 [2025-11-13 12:05:00,414][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 128 [2025-11-13 12:05:00,738][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 128 [2025-11-13 12:05:01,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 128 [2025-11-13 12:05:01,387][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 128 [2025-11-13 12:05:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 128 [2025-11-13 12:05:02,039][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 128 [2025-11-13 12:05:02,366][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 128 [2025-11-13 12:05:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 128 [2025-11-13 12:05:03,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 128 [2025-11-13 12:05:03,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 3840 tokens. [2025-11-13 12:05:04,036][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 2.51%, Current % of VRAM taken: 41.82%, Block Peak % of device VRAM: 25.97%, ΔTime: 00:00:11 [2025-11-13 12:05:04,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/policy_optimizer_state.pt [2025-11-13 12:05:04,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/critic_optimizer_state.pt [2025-11-13 12:05:04,739][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/ipd_ad_align_nocurrtimestep_seed123_bs128/seed_123/agent_trainer/trainer_annealing_state.pkl [2025-11-13 12:05:05,934][__main__][INFO] - Iteration 616 took 23s (38.45% Gen, 56.39% Train). Generation: 8s, Training: 13s. Estimated remaining time: 18h 19m 9s. Estimated total time: 19h 17m 25s. Time estimates for 10 more iterations: 3m 51s, 100 more iterations: 38m 34s, 500 more iterations: 3h 12m 54s. [2025-11-13 12:05:05,936][__main__][INFO] - Starting iteration 616. [2025-11-13 12:05:05,939][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 12:05:05,939][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 12:05:17,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,444][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,445][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,449][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,450][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,451][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,452][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,453][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,454][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,455][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,456][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,457][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,458][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,459][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,460][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,461][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,462][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,463][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,464][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,465][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,466][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,467][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,468][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,469][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,470][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,471][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,472][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,473][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,474][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,475][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,476][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,477][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,478][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,479][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,480][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,481][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,482][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,483][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,484][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,485][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,486][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,487][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,488][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,489][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,490][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,491][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,492][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,493][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,494][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,495][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,496][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,497][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,498][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,499][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,500][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,501][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,502][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,503][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,504][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,505][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,506][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,507][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,510][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,511][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,512][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,513][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,514][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,515][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,516][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,517][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,518][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,519][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,520][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,521][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,522][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,523][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,524][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,525][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,526][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,527][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,528][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,529][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,530][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,531][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,532][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,533][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,534][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,535][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,536][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,537][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,538][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,539][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,540][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,541][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,574][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,575][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,576][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,577][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,578][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,579][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,580][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,581][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,582][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,583][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,584][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,585][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,586][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,587][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,588][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,589][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,590][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,591][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,592][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,593][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,594][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,595][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,596][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,597][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,598][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,599][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,600][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,601][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,602][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,603][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,604][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,605][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,606][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,607][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,608][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,609][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,610][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,611][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,612][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,613][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,614][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,615][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,616][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,617][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,618][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,619][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,620][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,621][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,622][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,623][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,624][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,625][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,626][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,627][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,628][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,629][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,630][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,631][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,632][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,633][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,634][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,635][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,636][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,637][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,638][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,639][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,640][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,641][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,642][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,643][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,644][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,645][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,646][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,647][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,648][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,649][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,650][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,651][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,652][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,653][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,654][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,655][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,656][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,657][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,658][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,659][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,660][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,661][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,662][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,663][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,664][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,665][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,666][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,667][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,668][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,669][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,670][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,671][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,672][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,673][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,674][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,751][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,752][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,753][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,754][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,755][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,756][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,757][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,758][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,759][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,760][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,761][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,762][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,763][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,764][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,765][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,766][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,767][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,768][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,769][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,770][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,771][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,772][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,773][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,774][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,775][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,776][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,777][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,778][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,779][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,780][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,781][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,782][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,783][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,784][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,785][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,786][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,787][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,788][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,789][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,790][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,791][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,792][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,793][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,794][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,795][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,796][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,797][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,798][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,799][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,800][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,801][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,802][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,803][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,804][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,805][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,806][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,807][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,808][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,809][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,810][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,811][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,812][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,813][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,814][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,815][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,816][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,817][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,818][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,819][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,820][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,821][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,822][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,823][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,824][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,825][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,826][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,827][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,828][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,829][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,830][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,831][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,832][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,833][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,834][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,835][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,836][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,837][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,838][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,839][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,840][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,841][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,842][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,843][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,844][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,845][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,846][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,847][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,848][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,849][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,850][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,851][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,852][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,853][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,854][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,855][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,856][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,857][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,858][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,859][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,860][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,861][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,862][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,863][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,864][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,865][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,866][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,867][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,868][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,869][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,870][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,871][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,872][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,873][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,874][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,875][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,876][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,877][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,878][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,879][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,880][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,881][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,882][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,883][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,884][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,885][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,886][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,887][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,888][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,889][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,890][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,891][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,892][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,893][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,894][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,895][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,896][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,897][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,898][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,899][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,900][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,901][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,902][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,903][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,904][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,905][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,906][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,907][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,908][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,909][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,910][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,911][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,912][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,913][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,914][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,915][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,916][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,917][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,918][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,919][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,920][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,921][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,922][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,923][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,924][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,925][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,926][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,927][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,928][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,929][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,930][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,931][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,932][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,933][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,934][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,935][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,936][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,937][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,938][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,939][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,940][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,941][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,942][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,943][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,944][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,945][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,946][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,947][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,948][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,949][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,950][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,951][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,952][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,953][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:17,954][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,106][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,107][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,108][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,109][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,110][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,111][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,112][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,113][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,114][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,115][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,116][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,117][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,118][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,119][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,120][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,121][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,122][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,123][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,124][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,125][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,126][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,127][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,128][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,129][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,130][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,131][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,132][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,133][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,134][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,135][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,136][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,137][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,138][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,139][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,140][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,141][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,142][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,143][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,144][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,145][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,146][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,147][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,148][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,149][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,150][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,151][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,152][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,153][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,154][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,155][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,156][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,157][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,158][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,159][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,160][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,161][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,162][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,163][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,164][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,165][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,166][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,167][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,168][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,169][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,170][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,171][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,172][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,173][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,174][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,175][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,176][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,177][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,178][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,179][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,180][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,181][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,182][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,183][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,184][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,185][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,186][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,187][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,188][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,189][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,190][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,191][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,192][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,193][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,194][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,195][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,196][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,197][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,198][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,199][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,200][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,201][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,202][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,203][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,204][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,205][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,206][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,207][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,208][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,209][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,210][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,211][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,212][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,213][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,214][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,215][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,216][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,217][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,218][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,219][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,220][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,221][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,222][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,223][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,224][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,225][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,226][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,227][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,228][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,229][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,230][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,231][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,232][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,233][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,234][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,235][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,236][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,237][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,238][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,239][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,240][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,241][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,242][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,243][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,244][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,245][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,246][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,247][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,248][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,249][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,250][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,251][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,252][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,253][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,254][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,255][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,256][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,257][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,258][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,259][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,260][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,261][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,262][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,263][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,264][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,265][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,266][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,267][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,268][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,269][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,270][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,271][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,272][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,273][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,274][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,275][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,276][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,277][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,278][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,279][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,280][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,281][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,282][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,283][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,284][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,285][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,286][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,287][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,288][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,289][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,290][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,291][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,292][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,293][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,294][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,295][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,296][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,297][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,298][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,299][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,300][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,301][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,302][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,303][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,304][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,305][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,306][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,307][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,308][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,309][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,310][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,311][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,312][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,313][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,314][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,315][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,316][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,317][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,318][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,319][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,320][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,321][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,322][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,323][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,324][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,325][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,326][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,327][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,328][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,329][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,330][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,331][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,332][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,333][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,334][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,335][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,336][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,337][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,338][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,339][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,340][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,341][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,342][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,343][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,344][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,345][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,346][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,347][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,348][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,349][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,350][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,351][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,352][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,353][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,354][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,355][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,356][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,357][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,358][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,359][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,360][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,361][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,362][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,363][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,364][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,365][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,366][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,367][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,368][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,369][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,370][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,371][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,372][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,373][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,374][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,375][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,376][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,377][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,378][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,379][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,380][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,381][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,382][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,383][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,384][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,385][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,386][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,387][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,388][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,389][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,390][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,391][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,392][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,393][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,394][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,395][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,396][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,397][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,398][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,399][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,400][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,401][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,402][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,403][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,404][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,405][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,406][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,407][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,408][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,409][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,410][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,411][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,412][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,413][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,414][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,415][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,416][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,417][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,418][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,419][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,420][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,421][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,422][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,423][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,424][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,425][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,426][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,427][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,428][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,429][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,430][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,431][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,432][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,433][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,434][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,435][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,436][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,437][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,438][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,439][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,441][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,442][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,443][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,508][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,509][asyncio][WARNING] - socket.send() raised exception. [2025-11-13 12:05:18,509][asyncio][WARNING] - socket.send() raised exception.